fix(test-compose): set CERTCTL_AGENT_BOOTSTRAP_TOKEN placeholder (deploy-vendor-e2e job)

deploy-vendor-e2e was hidden behind the go-build-and-test failure; once that cleared (b1ca046), the vendor-e2e job actually booted certctl-test- server for the first time in a while and hit the Sprint 5 ACQ RED-003 fallout: Failed to load configuration: phase-2 SEC-H1 fail-closed guard: CERTCTL_AGENT_BOOTSTRAP_TOKEN is empty and CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY=true — refuse to start. The Sprint 5 RED-003 closure flipped DENY_EMPTY's default from false→true in production code, but the test compose stack never set a token. The fail-closed guard (internal/config/config.go:1054) refuses to start unless one of: - CERTCTL_AGENT_BOOTSTRAP_TOKEN is non-empty, OR - CERTCTL_DEMO_MODE_ACK=true (demo-mode override), OR - CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY=false (warn-mode escape hatch for v2.1.x→v2.2.x upgrade window) This is the e2e TEST stack with production-like auth posture (CERTCTL_AUTH_TYPE=api-key), not a demo stack. The right fix is the first option — set a deterministic placeholder token. Picking the warn-mode escape hatch would silently test the wrong posture; picking DEMO_MODE_ACK would also flip CERTCTL_AUTH_TYPE expectations. Also fixed deploy/ENVIRONMENTS.md: the entry still said 'default flip to true scheduled for v2.2.0', which became stale on 2026-05-16 when Sprint 5 ACQ RED-003 actually flipped it. Updated the default column from `false` to `true` and rewrote the description to reflect the current posture + the v2.1.x→v2.2.x warn-mode escape hatch. Verified locally: all 53 locally-runnable ci-guards still green (4 skipped: H-001-bare-from + H-002-bare-compose-image + digest-validity + no-precompiled-binary, all need docker-registry network). CI re-run on this commit should clear deploy-vendor-e2e's certctl-test-server dependency-failed-to-start step.
fix(deps): go mod tidy — drop unused google.golang.org/genproto bare module (CI go-mod-tidy gate)
2026-06-08 08:28:54 +00:00 · 2026-05-16 23:15:22 +00:00 · 2026-05-16 22:49:19 +00:00 · 2026-05-16 22:49:01 +00:00 · 2026-05-16 22:48:47 +00:00 · 2026-05-16 22:29:56 +00:00
636 changed files with 61121 additions and 9447 deletions
@@ -0,0 +1,118 @@
 # Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
 # 2026-05-16). Weekly backup-restore smoke test.
 #
 # Why
 # ===
 # The Helm CronJob at deploy/helm/certctl/templates/backup-cronjob.yaml
 # and the operator runbook at docs/operator/runbooks/postgres-backup.md
 # both document a pg_dump -Fc -based backup strategy, but the dump has
 # never been restored end-to-end under CI. A backup procedure that has
 # never been restore-tested is not a backup procedure. This workflow
 # adds the missing assertion.
 #
 # What
 # ====
 # Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 UTC
 # slot so they don't fight for runners), boot a real Postgres
 # 16-alpine container against the same digest pin as the production
 # deploy/docker-compose.yml, exercise the audit_events hash chain
 # with a small synthetic workload, pg_dump the database, drop the
 # schema, pg_restore, and assert the chain head + row count
 # round-trip byte-for-byte.
 #
 # The chain head round-trip property is the load-bearing assertion.
 # Migration 000047 hashes each audit_events row's canonical payload
 # with `to_char(timestamp AT TIME ZONE 'UTC',
 # 'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')`. Any TIMESTAMPTZ-precision loss
 # in the dump→restore path (a real concern across major Postgres
 # upgrades or with --format=plain) would corrupt the hash. The whole
 # point of testing instead of trusting docs is to PROVE the property
 # under a real workload.
 #
 # Workflow boundaries
 # ===================
 # - Does not exercise PITR / WAL archiving (DR runbook owns that).
 # - Does not exercise the Helm CronJob's S3 sink or scheduling
 #   (operator-side concern, not a property of the dump shape).
 # - Does not deploy or boot the certctl-server itself — the smoke
 #   harness talks to Postgres directly; we're testing the dump,
 #   not the server.
 name: backup-restore-smoke
 on:
  # Manual trigger from the Actions tab — useful before tagging a
  # release that touches the audit_events schema, or after a dep
  # bump that could affect canonical-payload formatting.
  workflow_dispatch:
  schedule:
    # Mondays at 07:00 UTC. Off-peak, off-set 1h from loadtest.yml
    # (06:00 UTC) so the two jobs don't fight for runners on the
    # GitHub-hosted ubuntu-latest pool.
    - cron: '0 7 * * 1'
 # Defense-in-depth: this job reads source and exercises a database;
 # it never needs write access to PRs, branches, releases, or
 # packages. Pin permissions to the minimum.
 permissions:
  contents: read
 jobs:
  backup-restore:
    name: pg_dump / pg_restore smoke
    runs-on: ubuntu-latest
    # 15-minute hard cap. The actual workload + dump + restore + verify
    # cycle runs in well under a minute on a warm runner; 15 minutes
    # absorbs cold image pulls, slow runner provisioning, and the
    # Postgres service-container readiness wait without letting a stuck
    # job consume the runner indefinitely.
    timeout-minutes: 15
    # Postgres service container. Pin to the same digest as
    # deploy/docker-compose.yml so the smoke runs against the exact
    # image the production deploy uses — a regression that surfaces
    # only on a specific Postgres minor bump shows up here on the
    # next image refresh in compose, not silently on a customer site.
    services:
      postgres:
        image: postgres:16-alpine@sha256:890480b08124ce7f79960a9bb16fe39729aa302bd384bfd7c408fee6c8f7adb7
        env:
          POSTGRES_DB: certctl
          POSTGRES_USER: certctl
          POSTGRES_PASSWORD: certctl
        ports:
          - 5432:5432
        # GitHub's services-container health check. The smoke shell
        # also waits for pg_isready as a belt-and-suspenders guard.
        options: >-
          --health-cmd "pg_isready -U certctl -d certctl"
          --health-interval 5s
          --health-timeout 3s
          --health-retries 10
    steps:
      - name: Checkout
        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
      - name: Set up Go
        uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff  # v5
        with:
          go-version: '1.25.10'
          # Cache go-build + go-mod for the weekly run. Keep the
          # cache key bound to go.sum so a dep bump invalidates it.
          cache: true
      - name: Run backup-restore smoke
        env:
          PGHOST: 127.0.0.1
          PGPORT: '5432'
          PGUSER: certctl
          PGPASSWORD: certctl
          PGDATABASE: certctl
          # Insert enough rows to exercise the chain over a non-trivial
          # length. 24 ≫ 1 — large enough to surface ordering bugs,
          # small enough that the dump finishes in seconds.
          SMOKE_ROWS: '24'
        run: bash deploy/test/backup-restore-smoke.sh
@@ -132,6 +132,18 @@ jobs:
        run: |
          go test ./internal/service/... ./internal/api/handler/... ./internal/api/middleware/... ./internal/api/router/... ./internal/auth/... ./internal/integration/... ./internal/connector/issuer/... ./internal/connector/target/... ./internal/connector/notifier/... ./internal/connector/discovery/... ./internal/crypto/... ./internal/mcp/... ./internal/cli/... ./internal/domain/... ./internal/validation/... ./internal/tlsprobe/... ./internal/ciparity/... -count=1 -cover -coverprofile=coverage.out
      - name: Multi-replica rate-limit integration test (Phase 13 Sprint 13.2/13.3 — ARCH-M1 closure proof)
        # The falsifiable proof that CERTCTL_RATE_LIMIT_BACKEND=postgres
        # enforces caps cluster-wide. testcontainers-go spins one
        # Postgres container; 3 *PostgresSlidingWindowLimiter instances
        # share it; 100 concurrent Allow("test-key") with cap=10 must
        # see exactly 10 succeed + 90 ErrRateLimited. Failure here =
        # the row-lock arbitration broke; ARCH-M1 closure is invalid.
        run: |
          go test -tags=integration -race -count=1 -timeout=300s \
              -run TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas \
              ./internal/integration/...
      - name: Check Coverage Thresholds
        # ci-pipeline-cleanup Phase 2: per-package floors moved to
        # .github/coverage-thresholds.yml. Each entry has `floor:` +
@@ -176,6 +188,15 @@ jobs:
      # 167 legitimate tests for no observable behavior change. The
      # Test<Func>_<Scenario>_<ExpectedResult> form remains the
      # recommended pattern for parameterized scenarios, but is not gated.
      # Phase 4 DEPL-* prerequisite (2026-05-14): helm-templates-lint.sh
      # needs the `helm` CLI on PATH to run helm lint + helm template
      # against the chart. The official azure/setup-helm action installs
      # a SHA-pinned helm binary into the runner.
      - name: Install Helm (for helm-templates-lint guard)
        uses: azure/setup-helm@b9e51907a09c216f16ebe8536097933489208112  # v4.3.0
        with:
          version: v3.16.0
      - name: Regression guards (extracted to scripts/ci-guards/)
        # All named regression guards live at scripts/ci-guards/<id>.sh per
        # ci-pipeline-cleanup bundle Phase 1. Each guard is callable locally:
@@ -403,6 +424,15 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
        with:
          # ARCH-001-A closure (Sprint 5, 2026-05-16). The
          # openapi-version-tag-parity guard needs the v* tags to
          # be present locally so it can confirm openapi.yaml's
          # info.version matches the latest release. Without
          # fetch-tags, the guard falls back to the GitHub API —
          # works but adds a network round-trip per CI run.
          fetch-tags: true
          fetch-depth: 0
      - name: Set up Node.js
        uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020  # v4
@@ -436,6 +466,17 @@ jobs:
        working-directory: web
        run: npx vite build
      - name: Frontend bundle-size budget (size-limit)
        # Acquisition-audit SCALE-007 closure (Sprint 6 ACQ, 2026-05-16).
        # Per-chunk + per-tier budgets in web/.size-limit.json; brotli-
        # compressed sizes match real-world download cost. A regression
        # that bloats a chunk past its cap fails this step and forces
        # an explicit operator decision (fix vs raise cap with rationale).
        # The script wrapper at scripts/ci-guards/G-frontend-bundle-budget.sh
        # is the local-runnable counterpart; both invoke `npm run size`.
        working-directory: web
        run: npm run size
      - name: Regression guards (extracted to scripts/ci-guards/)
        # All named regression guards live at scripts/ci-guards/<id>.sh per
        # ci-pipeline-cleanup bundle Phase 1. Each guard is callable locally:
@@ -0,0 +1,112 @@
 # Phase 8 closure (TEST-H1 + TEST-H2): browser-driven E2E + visual
 # regression.
 #
 # TEST-003 closure (Sprint 5, 2026-05-16): the suite has accumulated
 # the empirical green-run evidence the Phase 8 prompt required. 14
 # consecutive green runs across 2026-05-14 to 2026-05-15 (sampled
 # via api.github.com/repos/certctl-io/certctl/actions/runs) during
 # heavy Sprint 1-4 frontend churn confirm stability. The job is
 # now part of the merge gate (continue-on-error: false below).
 #
 # Operator action still required AFTER this commit pushes:
 #   - Add this job's "id" to the branch-protection required-checks
 #     list at https://github.com/certctl-io/certctl/settings/branches.
 #     Without that, the workflow's failure-blocks-merge contract
 #     only fires on PRs whose author is configured to honour the
 #     status check; configured required-checks make it universal.
 #
 # Visual regression: the 04-visual-regression.spec.ts file uses
 # Playwright `toHaveScreenshot()`. First-run on a new branch
 # regenerates baselines via the `--update-snapshots` flag; the
 # operator commits the resulting PNG bytes to git. Subsequent runs
 # pixel-diff. The dispatch input below provides an explicit knob
 # for that initial baseline pass without needing to edit the
 # workflow file. See docs/operator/runbooks/e2e-snapshot-update.md
 # for the snapshot-bump workflow.
 name: Frontend E2E
 on:
  push:
    branches: [master]
    paths:
      - 'web/**'
      - '.github/workflows/e2e.yml'
  pull_request:
    paths:
      - 'web/**'
      - '.github/workflows/e2e.yml'
  workflow_dispatch:
    inputs:
      update_snapshots:
        description: 'Regenerate visual-regression baselines (use sparingly)'
        type: boolean
        default: false
 permissions:
  contents: read
 jobs:
  e2e:
    name: Playwright E2E + visual regression
    runs-on: ubuntu-latest
    # TEST-003 closure (Sprint 5, 2026-05-16): flipped from
    # continue-on-error: true after 14 consecutive green runs across
    # 2026-05-14 to 2026-05-15 confirmed stability. Failures here
    # now fail the workflow, which (combined with the branch
    # protection update the operator owns post-merge) blocks merge.
    continue-on-error: false
    timeout-minutes: 15
    steps:
      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
      - name: Set up Node.js
        uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020  # v4
        with:
          node-version: '22'
      - name: Install Dependencies
        working-directory: web
        run: npm ci
      - name: Install Playwright browsers
        working-directory: web
        # --with-deps installs OS packages (libnss3, libatk1.0-0, etc.)
        # the chromium browser needs. Skipping this is the #1 source
        # of "tests pass locally but fail on CI" for new Playwright
        # users. The browser binary downloads to ~/.cache/ms-playwright;
        # the actions/setup-node cache key does NOT include it, so each
        # CI run re-downloads. Add an actions/cache step targeting
        # ~/.cache/ms-playwright keyed by the @playwright/test version
        # in package-lock.json once the suite is stable.
        run: npx playwright install --with-deps chromium
      - name: Run Playwright E2E + visual regression
        working-directory: web
        # The webServer block in playwright.config.ts boots `npm run dev`
        # automatically and waits for http://localhost:5173 to be
        # responsive before the first test fires. No separate "start
        # server" step needed.
        run: |
          if [[ "${{ github.event.inputs.update_snapshots }}" == "true" ]]; then
            echo "::warning::Regenerating visual-regression baselines"
            npx playwright test --update-snapshots
          else
            npx playwright test
          fi
      - name: Upload Playwright report on failure
        if: failure()
        uses: actions/upload-artifact@b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882  # v4
        with:
          name: playwright-report
          path: web/playwright-report/
          retention-days: 7
      - name: Upload visual-regression diffs on failure
        if: failure()
        uses: actions/upload-artifact@b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882  # v4
        with:
          name: visual-regression-diffs
          path: web/test-results/
          retention-days: 7
@@ -75,3 +75,65 @@ jobs:
          name: k6-summary-${{ github.run_id }}
          path: deploy/test/loadtest/results/
          retention-days: 90
  # ---------------------------------------------------------------------------
  # Phase 8 SCALE-H2 — scale-tier scenarios. Three new k6 drivers:
  #   - bulk-renewal: 10K-cert seed + criteria-mode POST /bulk-renew
  #   - acme-burst:   200 concurrent VUs against directory/nonce/ARI
  #   - agent-storm:  5K-agent seed + 167 heartbeats/sec sustained
  #
  # Matrix dispatch so each scenario runs on its own runner and a
  # regression in one doesn't mask another. The matrix runs in parallel,
  # which keeps total wall time around the existing 25-minute cap rather
  # than ~70 minutes serialised. Each scenario brings up the full
  # loadtest compose stack independently — there's no shared state
  # between scenarios that would benefit from a single-runner serial
  # invocation.
  #
  # Cadence: same as the API + connector tier job above (workflow_dispatch
  # + Mondays 06:00 UTC). The scale scenarios DO produce useful per-PR
  # signal in theory, but the per-run cost (image build + 5min run × 3)
  # is too high to gate on every PR; weekly is the right trade-off.
  # ---------------------------------------------------------------------------
  k6-scale:
    name: k6 scale tier (${{ matrix.scenario }})
    runs-on: ubuntu-latest
    timeout-minutes: 25
    needs: k6
    strategy:
      # Parallel: a failure in one scenario shouldn't cancel the others.
      # Each scenario's threshold breach is independent diagnostic data.
      fail-fast: false
      matrix:
        scenario:
          - bulk-renewal
          - acme-burst
          - agent-storm
    steps:
      - name: Checkout
        uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f  # v3
      - name: Run scale loadtest (${{ matrix.scenario }})
        env:
          BUILDKIT_PROGRESS: plain
        run: |
          case "${{ matrix.scenario }}" in
            bulk-renewal) make loadtest-scale-bulk ;;
            acme-burst)   make loadtest-scale-acme ;;
            agent-storm)  make loadtest-scale-agent ;;
            *) echo "::error::unknown scenario ${{ matrix.scenario }}"; exit 1 ;;
          esac
      - name: Upload summary
        if: always()
        uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02  # v4
        with:
          # Per-scenario artifact name so the three matrix runs don't
          # collide on upload.
          name: k6-scale-${{ matrix.scenario }}-${{ github.run_id }}
          path: deploy/test/loadtest/results/
          retention-days: 90
@@ -217,6 +217,19 @@ jobs:
      base64-subjects: "${{ needs.aggregate-checksums.outputs.hashes }}"
      upload-assets: true
      provenance-name: multiple.intoto.jsonl
      # Phase 1 RED-2 compat (2026-05-14): the SLSA reusable workflow's
      # default path downloads a pre-built generator binary from a
      # GitHub *release* of slsa-framework/slsa-github-generator —
      # releases are keyed by tag name (vX.Y.Z), and the workflow
      # rejects SHA-form refs with "Expected ref of the form
      # refs/tags/vX.Y.Z". Phase 1 RED-2 SHA-pinned every Actions
      # uses: line, so the default path errors out. Setting
      # compile-generator: true instead builds the generator from the
      # pinned-SHA source inside the workflow run — preserves
      # supply-chain integrity (SHA pin retained), adds ~1 min build
      # time. This is the SLSA project's documented escape hatch for
      # SHA-pinned reusable-workflow consumers.
      compile-generator: true
  # ----------------------------------------------------------------------
  # build-and-push-docker: push container images to GHCR with native
@@ -10,6 +10,7 @@ bin/
 # Frontend
 web/node_modules/
 web/dist/
 web/.storybook-static/
 # Test binary, built with `go test -c`
 *.test
@@ -46,6 +46,29 @@
  manually. Production deploys: this guard is irrelevant
  (`CERTCTL_DEMO_MODE_ACK` should not be set in production).
 ### Fixed
 - **GitHub #13 / Hotfix #19 — GUI "Something went wrong" after browser
  refresh on a real (non-demo) install.** Refresh-after-login wipes the
  in-memory `apiKey` (deliberate — the GUI never persists it to
  localStorage as a security posture). The next API call returns a
  bare 401 with no `WWW-Authenticate` header. Pre-Hotfix-19 the
  AuthProvider 401 handler only hard-navigated to `/login` when `cause`
  was a recognised OIDC session-expiry category (`idle_timeout` /
  `absolute_timeout` / `back_channel_revoked`); bare 401s
  (`cause === ''`) and `invalid_token` causes fell through to an
  in-place `AuthGate` state flip that unmounted `BrowserRouter` under
  an in-flight `<Link>`, triggering a `react-router-dom` invariant
  that surfaced via `ErrorBoundary` as the "Something went wrong"
  screen. **Fix:** every 401 now hard-navigates to `/login` regardless
  of cause; the cause-aware UX is preserved by forwarding
  `?session_expired=<cause>` only when cause is non-empty (bare 401s
  redirect to plain `/login`). Three-line change in
  `web/src/components/AuthProvider.tsx`; 4 regression tests added to
  `AuthProvider.test.tsx` (empty cause from `/targets`, `invalid_token`
  cause, `idle_timeout` cause, already-on-`/login` no-op guard).
  Closes #13.
 ### Security
 - **Alg-downgrade defense relaxed for Keycloak-shape IdPs (v2.1.0 pre-tag fix).**
@@ -1,4 +1,4 @@
-.PHONY: help build run test lint verify verify-deploy loadtest acme-cert-manager-test acme-rfc-conformance-test keycloak-integration-test okta-smoke-test benchmark-auth benchmark-auth-coldcache clean docker-up docker-down migrate-up migrate-down generate test-cover frontend-build e2e-test qa-stats
+.PHONY: help build run test lint verify verify-deploy loadtest loadtest-scale loadtest-scale-bulk loadtest-scale-acme loadtest-scale-agent acme-cert-manager-test acme-rfc-conformance-test keycloak-integration-test okta-smoke-test benchmark-auth benchmark-auth-coldcache clean docker-up docker-down migrate-up migrate-down generate test-cover frontend-build e2e-test qa-stats
 # Default target - show help
 help:
@@ -153,6 +153,49 @@ loadtest:
 	@echo "==> results landed in deploy/test/loadtest/results/"
 	@if [ -f deploy/test/loadtest/results/summary.txt ]; then cat deploy/test/loadtest/results/summary.txt; fi
 # Phase 8 SCALE-H2 — scale-tier load tests. Profile-gated in the
 # loadtest compose so the default `make loadtest` stays fast and
 # focused on the per-PR regression scope (API tier + connector tier).
 #
 # loadtest-scale-bulk runs the 10K-cert bulk-renew scenario.
 # loadtest-scale-acme runs the 200-VU ACME directory/nonce/ARI burst.
 # loadtest-scale-agent runs the 5K-agent heartbeat storm.
 #
 # Each target uses --exit-code-from <scenario-driver> so a threshold
 # breach surfaces as a non-zero make exit. The scale-seed init runs
 # once per invocation (idempotent via ON CONFLICT) so re-running a
 # target against the same compose stack is fine.
 loadtest-scale-bulk:
 	@echo "==> Phase 8 SCALE-H2: bulk-renewal scenario (10K cert fixture, ~6m)"
 	@cd deploy/test/loadtest && docker compose --profile scale up --build \
 	  --abort-on-container-exit --exit-code-from k6-scale-bulk
 	@echo ""
 	@echo "==> results: deploy/test/loadtest/results/summary-bulk-renewal.{json,txt}"
 	@if [ -f deploy/test/loadtest/results/summary-bulk-renewal.txt ]; then \
 	  cat deploy/test/loadtest/results/summary-bulk-renewal.txt; fi
 loadtest-scale-acme:
 	@echo "==> Phase 8 SCALE-H2: ACME enrollment burst (200 VU, ~6m)"
 	@cd deploy/test/loadtest && docker compose --profile scale up --build \
 	  --abort-on-container-exit --exit-code-from k6-scale-acme
 	@echo ""
 	@echo "==> results: deploy/test/loadtest/results/summary-acme-burst.{json,txt}"
 	@if [ -f deploy/test/loadtest/results/summary-acme-burst.txt ]; then \
 	  cat deploy/test/loadtest/results/summary-acme-burst.txt; fi
 loadtest-scale-agent:
 	@echo "==> Phase 8 SCALE-H2: agent heartbeat storm (5K agent fixture, ~6m)"
 	@cd deploy/test/loadtest && docker compose --profile scale up --build \
 	  --abort-on-container-exit --exit-code-from k6-scale-agent
 	@echo ""
 	@echo "==> results: deploy/test/loadtest/results/summary-agent-storm.{json,txt}"
 	@if [ -f deploy/test/loadtest/results/summary-agent-storm.txt ]; then \
 	  cat deploy/test/loadtest/results/summary-agent-storm.txt; fi
 # All three Phase 8 scenarios serially. Use the matrix in
 # .github/workflows/loadtest.yml for parallel CI runs.
 loadtest-scale: loadtest-scale-bulk loadtest-scale-acme loadtest-scale-agent
 # Auth Bundle 2 Phase 10 — Keycloak end-to-end OIDC integration test.
 # Boots a Keycloak container via testcontainers-go (quay.io/keycloak:25.0),
 # imports a canned realm with two groups + two users, and drives the
@@ -9,7 +9,7 @@
 [![GitHub Release](https://img.shields.io/github/v/release/certctl-io/certctl)](https://github.com/certctl-io/certctl/releases)
 [![GitHub Stars](https://img.shields.io/github/stars/certctl-io/certctl?style=flat&logo=github)](https://github.com/certctl-io/certctl/stargazers)
-certctl is a self-hosted platform that automates the entire TLS certificate lifecycle, from issuance through renewal to deployment, with zero human intervention. Twelve native CA connectors plus an OpenSSL / shell-script adapter for custom CAs; fifteen native deployment-target connectors plus a proxy-agent pattern for network appliances and agentless targets. Private keys stay on your infrastructure where they belong. Free, source-available under BSL 1.1, covers the same lifecycle that enterprise platforms charge $100K+/year for.
+certctl is a self-hosted platform that automates the entire TLS certificate lifecycle, from issuance through renewal to deployment, with zero human intervention. Twelve native CA connectors plus an OpenSSL / shell-script adapter for custom CAs; fourteen production-ready native deployment-target connectors plus Kubernetes Secrets (preview) and a proxy-agent pattern for network appliances and agentless targets. In agent-mode (the default), private keys stay on the host they were generated on and never touch the control plane; a demo-only `CERTCTL_KEYGEN_MODE=server` flag mints keys server-side, refuses to start without an explicit `CERTCTL_DEMO_MODE_ACK=true` acknowledgement. Free, source-available under BSL 1.1, covers the same lifecycle that enterprise platforms charge $100K+/year for.
 The CA/Browser Forum's [Ballot SC-081v3](https://cabforum.org/2025/04/11/ballot-sc081v3-introduce-schedule-of-reducing-validity-and-data-reuse-periods/) caps public TLS certificates at **200 days by March 2026**, **100 days by 2027**, and **47 days by 2029**. At 47-day lifespans, a team managing 100 certificates is processing 7+ renewals per week, every week, forever. Manual workflows stop being a choice.
@@ -64,7 +64,7 @@ Built for **platform engineering and DevOps teams** managing 10 to 500+ certific
 certctl handles the full certificate lifecycle in one self-hosted control plane:
 - **Issue and renew** from any CA. Let's Encrypt and any ACME provider, an embedded ACME server you can point cert-manager / certbot / lego at directly, a built-in local CA with sub-CA mode (chains under your enterprise root like ADCS), step-ca, Vault PKI, EJBCA, AWS ACM PCA, Google CAS, DigiCert, Sectigo, GlobalSign, Entrust, plus an OpenSSL / shell-script adapter for anything custom. Twelve native issuer connectors. See the [connector reference](docs/reference/connectors/index.md).
- **Deploy automatically** to NGINX, Apache, HAProxy, Caddy, Traefik, Envoy, IIS, Windows Cert Store, Java keystore, Kubernetes Secrets, AWS ACM, Azure Key Vault, SSH known-hosts, Postfix + Dovecot, F5 BIG-IP. Fifteen native target connectors. File-based targets share an atomic-write + SHA-256 idempotency + on-failure rollback + per-target Prometheus counters primitive (the `deploy.Apply` path covers 12 of 13 file-based connectors). Cloud / API targets (AWS ACM, Azure Key Vault) use vendor-SDK semantics rather than the file primitive; F5 uses iControl REST transactions; Kubernetes Secrets is preview. For the per-target guarantee matrix, see [`docs/reference/deployment-model.md`](docs/reference/deployment-model.md). The reload / validate commands operators configure for shell-using targets (NGINX, Apache, HAProxy, Postfix, JavaKeystore, SSH) are validated server-side AND agent-side against shell-metacharacter injection before execution (see [`internal/connector/target/configcheck`](internal/connector/target/configcheck)).
+- **Deploy automatically** to NGINX, Apache, HAProxy, Caddy, Traefik, Envoy, IIS, Windows Cert Store, Java keystore, AWS ACM, Azure Key Vault, SSH known-hosts, Postfix + Dovecot, F5 BIG-IP. **Fourteen production-ready native target connectors plus Kubernetes Secrets (preview).** File-based targets share an atomic-write + SHA-256 idempotency + on-failure rollback + per-target Prometheus counters primitive (the `deploy.Apply` path covers 12 of 13 file-based connectors). Cloud / API targets (AWS ACM, Azure Key Vault) use vendor-SDK semantics rather than the file primitive; F5 uses iControl REST transactions. The Kubernetes Secrets connector is shipped as preview because the production `client-go` integration is incomplete — see [`docs/reference/deployment-model.md`](docs/reference/deployment-model.md) for the per-target guarantee matrix. The reload / validate commands operators configure for shell-using targets (NGINX, Apache, HAProxy, Postfix, JavaKeystore, SSH) are validated server-side AND agent-side against shell-metacharacter injection before execution (see [`internal/connector/target/configcheck`](internal/connector/target/configcheck)).
 - **Run as an ACME server** so existing client tooling plugs in directly. RFC 8555 + RFC 9773 ARI, two per-profile auth modes (public-trust-style validation or trust_authenticated for internal PKI), doubly-signed key rollover, revoke-cert on both kid path and jwk path, per-account rate limiting. Cert-manager / certbot / lego all work pointed at it. See [`docs/reference/protocols/acme-server.md`](docs/reference/protocols/acme-server.md).
 - **Run as a SCEP server** for Microsoft Intune-managed phones, ChromeOS devices, network appliances. RFC 8894 native with full PKIMessage wire format, native Intune challenge dispatch with replay protection, per-profile dispatch with separate RA cert per profile. See [`docs/reference/protocols/scep-server.md`](docs/reference/protocols/scep-server.md).
 - **Run as an EST server** for HTTPS-based PKCS#10 enrollment. 802.1X / Wi-Fi authentication, IoT device enrollment, RFC 9266 channel binding. See [`docs/reference/protocols/est.md`](docs/reference/protocols/est.md).
@@ -75,11 +75,11 @@ certctl handles the full certificate lifecycle in one self-hosted control plane:
 - **Discover** existing certs across your fleet via filesystem scanning on agents, network TLS probing across CIDR ranges, and cloud secret manager imports (AWS Secrets Manager, Azure Key Vault, GCP Secret Manager). Triage workflow for claim / dismiss / investigate.
 - **Revoke** with full RFC 5280 reason codes, DER CRL generation per issuer (scheduler-pre-generated and ETag-cached), and an embedded RFC 6960 OCSP responder with dedicated per-issuer responder certs. Single + bulk revocation. See [`docs/reference/protocols/crl-ocsp.md`](docs/reference/protocols/crl-ocsp.md).
 - **Alert** via Slack, Microsoft Teams, PagerDuty, OpsGenie, email, webhooks. Per-policy multi-channel routing matrix with severity tiers and fault-isolating per-channel dispatch. See [`docs/operator/runbooks/expiry-alerts.md`](docs/operator/runbooks/expiry-alerts.md).
- **Drive the platform from natural language** via the bundled MCP (Model Context Protocol) server. The full REST API is exposed as MCP tools — ask your AI client "show me all expiring certificates", "revoke the VPN cert, key compromised", or "what agents are offline?" and it translates to API calls. Stateless stdio-transport binary at `cmd/mcp-server/`; same auth as the REST API; no extra attack surface. See [`docs/reference/mcp.md`](docs/reference/mcp.md).
+- **Drive the platform from natural language** via the bundled MCP (Model Context Protocol) server. The bulk of the REST API surface is exposed as MCP tools — ask your AI client "show me all expiring certificates", "revoke the VPN cert, key compromised", or "what agents are offline?" and it translates to API calls. Stateless stdio-transport binary at `cmd/mcp-server/`; same auth as the REST API; no extra attack surface. MCP-vs-REST parity (162 tools covering 221 routes; the gap is a small allowlist of streaming + protocol-conformance endpoints that don't fit the request-response tool shape) is tracked in [`docs/reference/mcp-coverage.md`](docs/reference/mcp-coverage.md) with a CI guard that fails the build if a new REST route lands without either an MCP tool or an explicit allowlist entry. See [`docs/reference/mcp.md`](docs/reference/mcp.md).
 ## Architecture and security
-Go 1.25 control plane with handler → service → repository layering. PostgreSQL 16 backend with idempotent migrations. Pull-only deployment model — the server never initiates outbound connections. Agents poll for work and generate ECDSA P-256 keys locally so private keys never touch the control plane. For network appliances and agentless servers, a proxy agent in the same network zone handles deployment via the target's API (WinRM, iControl REST, SSH/SFTP). See the [Architecture Guide](docs/reference/architecture.md) for full system diagrams.
+Go 1.25 control plane with handler → service → repository layering. PostgreSQL 16 backend with idempotent migrations. Pull-only deployment model — the server never initiates outbound connections. **In agent-keygen mode (the production default), agents poll for work and generate ECDSA P-256 keys locally, so private keys never touch the control plane.** The opposite path (`CERTCTL_KEYGEN_MODE=server`) is demo-only and refuses to boot in production without an explicit `CERTCTL_DEMO_MODE_ACK=true` acknowledgement. For network appliances and agentless servers, a proxy agent in the same network zone handles deployment via the target's API (WinRM, iControl REST, SSH/SFTP). See the [Architecture Guide](docs/reference/architecture.md) for full system diagrams.
 Security: three authentication paths — API keys (SHA-256 hashed + constant-time compared), [OIDC SSO](docs/operator/oidc-runbooks/index.md) (Keycloak / Authentik / Okta / Auth0 / Entra ID / Google Workspace), and Argon2id [break-glass admin](docs/operator/security.md) for SSO-outage recovery. Successful OIDC login mints an HMAC-signed server-side session with `__Host-` cookies, CSRF rotation on every privileged write, and [RFC OIDC Back-Channel Logout](docs/reference/auth-standards-implemented.md) for IdP-driven session revoke. Role-based authorization on every gated handler with global / per-profile / per-issuer scope. Auditor split keeps regulator-class actors strictly read-only on the audit trail. Day-0 admin via a one-shot bootstrap token; granting or revoking roles requires the dedicated `auth.role.assign` permission. CORS deny-by-default. Shell injection prevention on all connector scripts. SSRF protection (reserved IP filtering) on the network scanner. Issuer + target + OIDC client_secret credentials encrypted at rest with AES-256-GCM. HTTPS-only control plane with TLS 1.3 pinned and a fail-closed startup gate that refuses to boot if the TLS bundle is unusable. Every API call recorded to an immutable audit trail with actor attribution, body hash, and latency tracking. CI runs race detection, static analysis, and vulnerability scanning on every commit. See [`docs/operator/security.md`](docs/operator/security.md) for the full posture and [`docs/operator/auth-threat-model.md`](docs/operator/auth-threat-model.md) for what's defended vs deferred.
@@ -92,10 +92,12 @@ Security: three authentication paths — API keys (SHA-256 hashed + constant-tim
 ```bash
 git clone https://github.com/certctl-io/certctl.git
 cd certctl
-docker compose -f deploy/docker-compose.yml -f deploy/docker-compose.demo.yml up -d --build
+./deploy/demo-up.sh -d --build
 ```
-Wait ~30 seconds, then open **https://localhost:8443** in your browser. The demo overlay flips the base into demo-mode auth (every request served as the synthetic admin actor `actor-demo-anon` — the server emits a prominent ⚠ DEMO MODE banner at boot reminding you this posture is for evaluation only) and seeds 180 days of realistic history across 13 issuers, 8 agents, managed + discovered certs, jobs, deploys, audit, and notification events. The `certctl-tls-init` init container self-signs an ECDSA-P256 cert on first boot — accept the browser warning for the demo, or feed the generated `ca.crt` to your client.
+Wait ~30 seconds, then open **https://localhost:8443** in your browser. The `demo-up.sh` wrapper exports a fresh `CERTCTL_DEMO_MODE_ACK_TS=$(date +%s)` and forwards the remaining args to `docker compose -f docker-compose.yml -f docker-compose.demo.yml up`. The timestamp export is required by the Phase 2 SEC-H3 fail-closed guard in `internal/config/config.go::Validate` — demo deploys must re-ACK every 24h so a forgotten demo container never silently ends up serving production traffic with `auth-type=none`. The bare `docker compose ... up` command without the timestamp refuses to boot; the wrapper script is the supported entry point.
 The demo overlay flips the base into demo-mode auth (every request served as the synthetic admin actor `actor-demo-anon` — the server emits a prominent ⚠ DEMO MODE banner at boot reminding you this posture is for evaluation only) and seeds 180 days of realistic history across 13 issuers, 8 agents, managed + discovered certs, jobs, deploys, audit, and notification events. The `certctl-tls-init` init container self-signs an ECDSA-P256 cert on first boot — accept the browser warning for the demo, or feed the generated `ca.crt` to your client.
 **Production path — `.env` required, fail-closed on placeholders:**
@@ -0,0 +1 @@
 0
@@ -1,48 +1,100 @@
 # Routes registered in internal/api/router/router.go that are intentionally
-# NOT in api/openapi.yaml. Each entry needs a one-line `why:` justification.
+# NOT in api/openapi.yaml. Each entry needs a one-line `why:` justification
 # AND a required `category:` field (added in Phase 13 Sprint 13.1,
 # 2026-05-14, architecture diligence audit ARCH-H1).
 #
 # Adding a new entry requires PR-time review.
 #
 # OpenAPI-shaped REST endpoints belong in api/openapi.yaml, NOT here.
-# This list is for protocol-shaped (SCEP wire endpoints) and operational
+# This list is for protocol-shaped (SCEP/ACME/EST wire endpoints) and
-# (health, metrics, pprof) routes only.
+# operational (health, metrics, pprof) routes only.
 #
 # Per ci-pipeline-cleanup bundle Phase 9 / frozen decision 0.11.
 #
-# Phase 5 reconciliation (2026-05-13, architecture diligence audit
+# ──────────────────────────────────────────────────────────────────────
-# ARCH-H1): of the 64 entries below, 35 are legitimate wire-protocol
+# The two-bucket contract (Phase 13 Sprint 13.1)
-# carve-outs (SCEP RFC 8894 = 8 entries, ACME RFC 8555 default + per-
+# ──────────────────────────────────────────────────────────────────────
 # profile = 27 entries) that MUST stay. The remaining 29 are REST-
 # shaped routes whose OpenAPI ops were deferred during their original
 # Bundle 2 / audit-2026-05-10 / 2026-05-11 work. Burn-down plan:
 #
-#   Sprint A (per-cluster, ~7-8 ops each):
+#   category: wire-protocol
-#     Cluster 1: auth/sessions + auth/oidc (12 ops)
+#     The route's wire shape is dictated by an IETF RFC (SCEP RFC 8894,
-#     Cluster 2: auth/breakglass + auth/users + auth/runtime-config (8 ops)
+#     ACME RFC 8555, ACME ARI RFC 9773, EST RFC 7030) or it's a
-#     Cluster 3: audit/export + demo-residual/cleanup + auth/logout +
+#     sibling/shorthand variant of such a route (same wire semantics,
-#                auth/breakglass/login + auth/oidc/{login,callback,bcl} (9 ops)
+#     different cosmetic path — e.g. trailing-slash forms, default-
 #     profile shorthands). Documenting these as REST operations in
 #     openapi.yaml would duplicate the RFC with no information gain;
 #     the canonical operator references live in docs/acme-server.md +
 #     docs/operator/scep.md + docs/operator/est.md. These entries
 #     NEVER burn down — they're protocol contracts, not gaps.
 #
 #   category: rest-deferred
 #     The route is REST-shaped (resource CRUD, JSON request/response,
 #     RBAC-gated) but its OpenAPI operation was deferred when the
 #     handler shipped. These MUST monotonically decrease to zero.
 #     Phase 13 Sprints 13.4-13.6 author the OpenAPI ops + delete the
 #     corresponding exception entries; the
 #     openapi-rest-deferred-monotonic.sh CI guard fails any PR that
 #     grows the rest-deferred bucket vs the checked-in baseline at
 #     api/openapi-handler-exceptions-baseline.txt.
 #
 # ──────────────────────────────────────────────────────────────────────
 # Phase 13 Sprint 13.1 categorization (2026-05-14)
 # ──────────────────────────────────────────────────────────────────────
 #
 # Current split, re-derived by the parity script's bucket-reporting
 # subcommand (post-Sprint-13.6 / 2026-05-14):
 #
 #   total entries:           36
 #   wire-protocol:           36
 #   rest-deferred:           0    ← THE FLOOR — ARCH-H1 substantive close
 #
 # Burn-down progress:
 #
 #   Sprint 13.4 SHIPPED — 28 - 13 = 15 (auth/sessions cluster 3 ops +
 #                               auth/oidc CRUD + JWKS + test + refresh
 #                               + group-mappings cluster, 10 ops)
 #   Sprint 13.5 SHIPPED — 15 -  8 =  7 (auth/breakglass admin 4 ops +
 #                               auth/users 3 ops + auth/runtime-config
 #                               1 op, 8 ops total)
 #   Sprint 13.6 SHIPPED —  7 -  7 =  0 (audit/export 1 op + demo-
 #                               residual/cleanup 1 op + auth/logout 1 op +
 #                               auth/breakglass/login 1 op + 3 OIDC
 #                               browser-flow endpoints, 7 ops total)
 #
 # Sprint 13.7 next tightens the parity-script's rest-deferred floor
 # from monotonic-decrease to a hard zero-exact pin. After that, any
 # new REST route MUST land with an OpenAPI op or fail CI — no escape
 # hatch via `category: rest-deferred`.
 #
 # Each authored OpenAPI op needs request/response schemas (not
 # placeholders) so the generated client at web/orval.config.ts emits
 # typed signatures. When an op lands, delete the corresponding entry
-# below + bump the openapi-handler-parity.sh expected counts.
+# below + bump api/openapi-handler-exceptions-baseline.txt downward.
 documented_exceptions:
  - route: "GET /scep"
    why: "SCEP wire-protocol endpoint per RFC 8894 §3.1; serves CA certs via GetCACert/GetCACaps query params, NOT a REST resource."
    category: wire-protocol
  - route: "POST /scep"
    why: "SCEP wire-protocol endpoint per RFC 8894 §3.1; receives PKCSReq / RenewalReq PKIMessages, NOT a REST resource."
    category: wire-protocol
  - route: "GET /scep/"
    why: "SCEP wire-protocol endpoint with trailing-slash variant; ChromeOS clients send the trailing-slash form."
    category: wire-protocol
  - route: "POST /scep/"
    why: "SCEP wire-protocol endpoint with trailing-slash variant; ChromeOS clients send the trailing-slash form."
    category: wire-protocol
  - route: "GET /scep-mtls"
    why: "SCEP-mTLS sibling endpoint per ci-pipeline-cleanup-prerequisite EST RFC 7030 hardening Phase 6.5; same wire-protocol semantics, mutually-authenticated TLS variant."
    category: wire-protocol
  - route: "POST /scep-mtls"
    why: "SCEP-mTLS sibling endpoint, POST variant."
    category: wire-protocol
  - route: "GET /scep-mtls/"
    why: "SCEP-mTLS sibling endpoint, trailing-slash variant."
    category: wire-protocol
  - route: "POST /scep-mtls/"
    why: "SCEP-mTLS sibling endpoint, trailing-slash POST variant."
    category: wire-protocol
  # ACME server (RFC 8555 + RFC 9773 ARI) — wire-protocol surface.
  # Like SCEP/EST, ACME is a JWS-signed-JSON wire protocol whose
@@ -54,62 +106,90 @@ documented_exceptions:
  # challenge, cert, key-change, revoke-cert, renewal-info routes land.
  - route: "GET /acme/profile/{id}/directory"
    why: "ACME server RFC 8555 §7.1.1 directory; documented in docs/acme-server.md."
    category: wire-protocol
  - route: "HEAD /acme/profile/{id}/new-nonce"
    why: "ACME server RFC 8555 §7.2 new-nonce; documented in docs/acme-server.md."
    category: wire-protocol
  - route: "GET /acme/profile/{id}/new-nonce"
    why: "ACME server RFC 8555 §7.2 new-nonce GET form; documented in docs/acme-server.md."
    category: wire-protocol
  - route: "POST /acme/profile/{id}/new-account"
    why: "ACME server RFC 8555 §7.3 new-account (JWS jwk); documented in docs/acme-server.md."
    category: wire-protocol
  - route: "POST /acme/profile/{id}/account/{acc_id}"
    why: "ACME server RFC 8555 §7.3.2 + §7.3.6 (JWS kid) account update + deactivation; documented in docs/acme-server.md."
    category: wire-protocol
  - route: "GET /acme/directory"
    why: "ACME server default-profile shorthand; mirrors per-profile when CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID is set."
    category: wire-protocol
  - route: "HEAD /acme/new-nonce"
    why: "ACME server default-profile shorthand for new-nonce HEAD."
    category: wire-protocol
  - route: "GET /acme/new-nonce"
    why: "ACME server default-profile shorthand for new-nonce GET."
    category: wire-protocol
  - route: "POST /acme/new-account"
    why: "ACME server default-profile shorthand for new-account."
    category: wire-protocol
  - route: "POST /acme/account/{acc_id}"
    why: "ACME server default-profile shorthand for account update + deactivation."
    category: wire-protocol
  # Phase 2 — orders + finalize + authz + cert.
  - route: "POST /acme/profile/{id}/new-order"
    why: "ACME server RFC 8555 §7.4 new-order; documented in docs/acme-server.md."
    category: wire-protocol
  - route: "POST /acme/profile/{id}/order/{ord_id}"
    why: "ACME server RFC 8555 §7.4 order POST-as-GET; documented in docs/acme-server.md."
    category: wire-protocol
  - route: "POST /acme/profile/{id}/order/{ord_id}/finalize"
    why: "ACME server RFC 8555 §7.4 finalize; documented in docs/acme-server.md."
    category: wire-protocol
  - route: "POST /acme/profile/{id}/authz/{authz_id}"
    why: "ACME server RFC 8555 §7.5 authz POST-as-GET; documented in docs/acme-server.md."
    category: wire-protocol
  - route: "POST /acme/profile/{id}/challenge/{chall_id}"
    why: "ACME server RFC 8555 §7.5.1 challenge response; dispatches to Phase 3 validator pool."
    category: wire-protocol
  - route: "POST /acme/profile/{id}/cert/{cert_id}"
    why: "ACME server RFC 8555 §7.4.2 cert download; documented in docs/acme-server.md."
    category: wire-protocol
  - route: "POST /acme/new-order"
    why: "Phase 2 default-profile shorthand for new-order."
    category: wire-protocol
  - route: "POST /acme/order/{ord_id}"
    why: "Phase 2 default-profile shorthand for order POST-as-GET."
    category: wire-protocol
  - route: "POST /acme/order/{ord_id}/finalize"
    why: "Phase 2 default-profile shorthand for finalize."
    category: wire-protocol
  - route: "POST /acme/authz/{authz_id}"
    why: "Phase 2 default-profile shorthand for authz POST-as-GET."
    category: wire-protocol
  - route: "POST /acme/challenge/{chall_id}"
    why: "Phase 3 default-profile shorthand for challenge response."
    category: wire-protocol
  - route: "POST /acme/cert/{cert_id}"
    why: "Phase 2 default-profile shorthand for cert download."
    category: wire-protocol
  - route: "POST /acme/profile/{id}/key-change"
    why: "ACME server RFC 8555 §7.3.5 doubly-signed key rollover; documented in docs/acme-server.md."
    category: wire-protocol
  - route: "POST /acme/profile/{id}/revoke-cert"
    why: "ACME server RFC 8555 §7.6 revoke-cert (kid OR cert-key auth); documented in docs/acme-server.md."
    category: wire-protocol
  - route: "GET /acme/profile/{id}/renewal-info/{cert_id}"
    why: "ACME server RFC 9773 ACME Renewal Information (unauthenticated GET); documented in docs/acme-server.md."
    category: wire-protocol
  - route: "POST /acme/key-change"
    why: "Phase 4 default-profile shorthand for key rollover."
    category: wire-protocol
  - route: "POST /acme/revoke-cert"
    why: "Phase 4 default-profile shorthand for revoke-cert."
    category: wire-protocol
  - route: "GET /acme/renewal-info/{cert_id}"
    why: "Phase 4 default-profile shorthand for ARI."
    category: wire-protocol
  # =============================================================================
  # Auth Bundle 2 + audit-2026-05-10/11 fix bundle — REST endpoints not yet
@@ -119,59 +199,3 @@ documented_exceptions:
  # stays green for the v2.1.0 release tag. Threat model + handler contracts
  # live in docs/operator/{rbac.md,auth-threat-model.md,oidc-runbooks/*}.
  # =============================================================================
  - route: "GET /auth/oidc/login"
    why: "Bundle 2 Phase 5 OIDC login redirect; user-facing 302 with state cookie. OpenAPI rep deferred to pre-2.2.0."
  - route: "GET /auth/oidc/callback"
    why: "Bundle 2 Phase 5 OIDC callback handler; RFC 9700 §4.7.1 + RFC 9207. OpenAPI rep deferred to pre-2.2.0."
  - route: "POST /auth/logout"
    why: "Bundle 2 Phase 5 cookie + CSRF revoker. OpenAPI rep deferred to pre-2.2.0."
  - route: "POST /auth/breakglass/login"
    why: "Bundle 2 Phase 7.5 public break-glass login (auth-bypass, 404 when disabled). OpenAPI rep deferred to pre-2.2.0."
  - route: "POST /auth/oidc/back-channel-logout"
    why: "Bundle 2 Phase 5 RFC OIDC Back-Channel Logout 1.0 endpoint. OpenAPI rep deferred to pre-2.2.0."
  - route: "GET /api/v1/auth/sessions"
    why: "Bundle 2 Phase 5 self/admin session list. OpenAPI rep deferred to pre-2.2.0."
  - route: "DELETE /api/v1/auth/sessions/{id}"
    why: "Bundle 2 Phase 5 session revoke. OpenAPI rep deferred to pre-2.2.0."
  - route: "DELETE /api/v1/auth/sessions"
    why: "Bundle 2 audit-2026-05-10 MED-2/3 revoke-all-except-current."
  - route: "GET /api/v1/auth/oidc/providers"
    why: "Bundle 2 Phase 5 OIDC provider CRUD (list)."
  - route: "POST /api/v1/auth/oidc/providers"
    why: "Bundle 2 Phase 5 OIDC provider CRUD (create)."
  - route: "PUT /api/v1/auth/oidc/providers/{id}"
    why: "Bundle 2 Phase 5 OIDC provider CRUD (update)."
  - route: "DELETE /api/v1/auth/oidc/providers/{id}"
    why: "Bundle 2 Phase 5 OIDC provider CRUD (delete)."
  - route: "POST /api/v1/auth/oidc/providers/{id}/refresh"
    why: "Bundle 2 audit-2026-05-10 MED-7 JWKS hot-refresh."
  - route: "GET /api/v1/auth/oidc/providers/{id}/jwks-status"
    why: "Bundle 2 audit-2026-05-10 MED-7 JWKS health snapshot."
  - route: "POST /api/v1/auth/oidc/test"
    why: "Bundle 2 audit-2026-05-10 MED-5 dry-run discovery + JWKS + alg-downgrade check."
  - route: "GET /api/v1/auth/oidc/group-mappings"
    why: "Bundle 2 Phase 5 group-mapping CRUD (list)."
  - route: "POST /api/v1/auth/oidc/group-mappings"
    why: "Bundle 2 Phase 5 group-mapping CRUD (create)."
  - route: "DELETE /api/v1/auth/oidc/group-mappings/{id}"
    why: "Bundle 2 Phase 5 group-mapping CRUD (delete)."
  - route: "GET /api/v1/auth/breakglass/credentials"
    why: "Bundle 2 Phase 7.5 admin break-glass list (404 when disabled; password hash never on wire)."
  - route: "POST /api/v1/auth/breakglass/credentials"
    why: "Bundle 2 Phase 7.5 admin break-glass set/rotate password."
  - route: "POST /api/v1/auth/breakglass/credentials/{actor_id}/unlock"
    why: "Bundle 2 Phase 7.5 admin break-glass unlock after lockout."
  - route: "DELETE /api/v1/auth/breakglass/credentials/{actor_id}"
    why: "Bundle 2 Phase 7.5 admin break-glass credential delete."
  - route: "GET /api/v1/auth/users"
    why: "Bundle 2 audit-2026-05-10 MED-11 users page."
  - route: "DELETE /api/v1/auth/users/{id}"
    why: "Bundle 2 audit-2026-05-10 MED-11 user deactivate."
  - route: "POST /api/v1/auth/users/{id}/reactivate"
    why: "Bundle 2 audit-2026-05-10 MED-11 user reactivate."
  - route: "GET /api/v1/auth/runtime-config"
    why: "Bundle 2 audit-2026-05-10 MED-12 effective auth-runtime-config (read-only)."
  - route: "POST /api/v1/auth/demo-residual/cleanup"
    why: "Audit 2026-05-11 A-8 demo-mode residual-grants cleanup endpoint."
  - route: "GET /api/v1/audit/export"
    why: "Bundle 1 Phase 8 streaming NDJSON audit export."
@@ -0,0 +1,458 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package main
 import (
 	"context"
 	"encoding/json"
 	"encoding/pem"
 	"fmt"
 	"io"
 	"net/http"
 	"os"
 	"strings"
 	"github.com/certctl-io/certctl/internal/connector/target"
 	"github.com/certctl-io/certctl/internal/connector/target/apache"
 	"github.com/certctl-io/certctl/internal/connector/target/awsacm"
 	"github.com/certctl-io/certctl/internal/connector/target/azurekv"
 	"github.com/certctl-io/certctl/internal/connector/target/caddy"
 	"github.com/certctl-io/certctl/internal/connector/target/envoy"
 	"github.com/certctl-io/certctl/internal/connector/target/f5"
 	"github.com/certctl-io/certctl/internal/connector/target/haproxy"
 	"github.com/certctl-io/certctl/internal/connector/target/iis"
 	jks "github.com/certctl-io/certctl/internal/connector/target/javakeystore"
 	k8s "github.com/certctl-io/certctl/internal/connector/target/k8ssecret"
 	"github.com/certctl-io/certctl/internal/connector/target/nginx"
 	pf "github.com/certctl-io/certctl/internal/connector/target/postfix"
 	sshconn "github.com/certctl-io/certctl/internal/connector/target/ssh"
 	"github.com/certctl-io/certctl/internal/connector/target/traefik"
 	wcs "github.com/certctl-io/certctl/internal/connector/target/wincertstore"
 )
 // Phase 9 ARCH-M2 closure Sprint 12 (2026-05-14): extracted from
 // cmd/agent/main.go via the Option B sibling-file pattern.
 //
 // This file holds the DEPLOYMENT executor + the target connector
 // factory + the deploy-only helpers:
 //
 //   - executeDeploymentJob: handles Pending deployment jobs by
 //     fetching the cert PEM from the control plane, loading the
 //     locally-held private key (in agent keygen mode), instantiating
 //     the appropriate target connector via createTargetConnector,
 //     calling DeployCertificate on it, and reporting Completed or
 //     Failed back to the control plane.
 //   - createTargetConnector: the big switch over target_type that
 //     instantiates one of 14 target connectors (apache / awsacm /
 //     azurekv / caddy / envoy / f5 / haproxy / iis / javakeystore /
 //     k8ssecret / nginx / postfix / ssh / traefik / wincertstore).
 //     Context is threaded into SDK-driven connectors (AWSACM,
 //     AzureKeyVault) so credential resolution honors caller
 //     cancellation per the contextcheck linter — see CI commit
 //     502823d.
 //   - splitPEMChain: split a PEM chain into (first cert, rest).
 //   - fetchCertificate: pull the PEM chain from
 //     GET /api/v1/certificates/{certID}/version.
 //
 // All 14 target-connector imports were used ONLY by
 // createTargetConnector; moving the factory here also moved the
 // 14 connector imports out of main.go, leaving the surviving
 // cmd/agent/main.go with the minimal stdlib surface its lifecycle
 // + HTTP infrastructure needs.
 // executeDeploymentJob executes a deployment job by fetching the certificate and deploying it
 // to the target system using the appropriate connector (NGINX, F5 BIG-IP, or IIS).
 //
 // For agent keygen mode, the private key is read from the local key store (keyDir/certID.key)
 // rather than fetched from the server. The deployment includes the locally-held key.
 //
 // Flow:
 // 1. Report job as Running
 // 2. Fetch the certificate PEM from the control plane
 // 3. Load local private key if it exists (agent keygen mode)
 // 4. Instantiate the target connector based on target_type from the work response
 // 5. Call DeployCertificate on the connector
 // 6. Report job as Completed (or Failed)
 func (a *Agent) executeDeploymentJob(ctx context.Context, job JobItem) {
 	a.logger.Info("executing deployment job",
 		"job_id", job.ID,
 		"certificate_id", job.CertificateID,
 		"target_type", job.TargetType)
 	// Report job as running
 	if err := a.reportJobStatus(ctx, job.ID, "Running", ""); err != nil {
 		a.logger.Error("failed to report job running", "error", err)
 	}
 	// Fetch the certificate from the control plane
 	certPEM, err := a.fetchCertificate(ctx, job.CertificateID)
 	if err != nil {
 		a.logger.Error("failed to fetch certificate",
 			"job_id", job.ID,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("cert fetch failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	a.logger.Info("certificate fetched for deployment",
 		"job_id", job.ID,
 		"cert_length", len(certPEM))
 	// Split PEM into cert and chain (separated by double newline between PEM blocks)
 	certOnly, chainPEM := splitPEMChain(certPEM)
 	// Check for locally-stored private key (agent keygen mode).
 	//
 	// SEC-002 closure (Sprint 1, 2026-05-16): safeAgentKeyPath validates
 	// the certificate_id shape AND asserts the joined path is contained
 	// within a.config.KeyDir. A crafted certificate_id (path traversal,
 	// absolute path, NUL byte, Windows separators) fails closed before
 	// any disk I/O. See cmd/agent/keymem.go for the helper.
 	keyPath, kerr := safeAgentKeyPath(a.config.KeyDir, job.CertificateID)
 	if kerr != nil {
 		a.logger.Error("agent key path validation failed for deployment",
 			"job_id", job.ID,
 			"certificate_id", job.CertificateID,
 			"error", kerr)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key path validation failed: %v", kerr)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "error", reportErr)
 		}
 		return
 	}
 	var keyPEM string
 	keyData, err := os.ReadFile(keyPath)
 	if err != nil {
 		a.logger.Error("failed to read local private key for deployment",
 			"job_id", job.ID,
 			"key_path", keyPath,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key read failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "error", reportErr)
 		}
 		return
 	}
 	keyPEM = string(keyData)
 	a.logger.Info("loaded local private key for deployment",
 		"job_id", job.ID,
 		"key_path", keyPath)
 	// Deploy to the target using the appropriate connector
 	if job.TargetType != "" {
 		connector, err := a.createTargetConnector(ctx, job.TargetType, job.TargetConfig)
 		if err != nil {
 			a.logger.Error("failed to create target connector",
 				"job_id", job.ID,
 				"target_type", job.TargetType,
 				"error", err)
 			if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("connector init failed: %v", err)); reportErr != nil {
 				a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 			}
 			return
 		}
 		// Bundle 1 / RT-C1 closure (2026-05-12): defense in depth. The server
 		// runs internal/connector/target/configcheck.Validate on the way IN
 		// (Create/Update), and rejects shell metacharacters in command-bearing
 		// fields. Re-run the connector's full ValidateConfig here on the way
 		// OUT, before any DeployCertificate call. This catches (a) configs
 		// that pre-date the server-side guard, (b) corruption/tampering of
 		// the encrypted config blob, and (c) per-connector filesystem
 		// invariants (cert dir exists, paths writable) that the server can't
 		// check because the filesystem is on the agent host.
 		if err := connector.ValidateConfig(ctx, job.TargetConfig); err != nil {
 			a.logger.Error("connector config validation failed",
 				"job_id", job.ID,
 				"target_type", job.TargetType,
 				"error", err)
 			if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("%s config validation failed: %v", job.TargetType, err)); reportErr != nil {
 				a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 			}
 			return
 		}
 		deployReq := target.DeploymentRequest{
 			CertPEM:      certOnly,
 			KeyPEM:       keyPEM,
 			ChainPEM:     chainPEM,
 			TargetConfig: job.TargetConfig,
 			Metadata: map[string]string{
 				"certificate_id": job.CertificateID,
 				"job_id":         job.ID,
 			},
 		}
 		// Phase 2 of the deploy-hardening I master bundle:
 		// per-target deploy mutex. Acquire BEFORE
 		// DeployCertificate so two concurrent renewals against
 		// the same target ID serialize. The lock is held for the
 		// full Deploy duration including PreCommit (validate),
 		// PostCommit (reload), and post-deploy verify (Phases
 		// 4-9). Released on every return path via defer.
 		var targetID string
 		if job.TargetID != nil {
 			targetID = *job.TargetID
 		}
 		if mu := a.targetDeployMutex(targetID); mu != nil {
 			mu.Lock()
 			defer mu.Unlock()
 		}
 		result, err := connector.DeployCertificate(ctx, deployReq)
 		if err != nil {
 			a.logger.Error("deployment failed",
 				"job_id", job.ID,
 				"target_type", job.TargetType,
 				"error", err)
 			if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("deployment failed: %v", err)); reportErr != nil {
 				a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 			}
 			return
 		}
 		a.logger.Info("target connector deployment completed",
 			"job_id", job.ID,
 			"target_type", job.TargetType,
 			"success", result.Success,
 			"message", result.Message)
 		// If verification is enabled, verify the deployment by probing the live TLS endpoint
 		targetHost, targetPort, err := extractTargetHostAndPort(job.TargetConfig)
 		if err != nil {
 			a.logger.Warn("could not extract target host/port for verification",
 				"job_id", job.ID,
 				"error", err)
 		} else {
 			a.verifyAndReportDeployment(ctx, job, targetHost, targetPort, certOnly)
 		}
 	} else {
 		a.logger.Info("no target type specified, skipping connector invocation",
 			"job_id", job.ID)
 	}
 	// Report job as completed
 	if err := a.reportJobStatus(ctx, job.ID, "Completed", ""); err != nil {
 		a.logger.Error("failed to report job completed", "error", err)
 		return
 	}
 	a.logger.Info("deployment job completed", "job_id", job.ID)
 }
 // createTargetConnector instantiates the appropriate target connector based on type.
 // ctx is threaded into SDK-driven connectors (AWSACM, AzureKeyVault) so credential
 // resolution honors caller cancellation / deadlines instead of using a fresh
 // context.Background() (the contextcheck linter enforces this — the original Rank 5
 // implementation used Background() and tripped CI on commit 502823d).
 func (a *Agent) createTargetConnector(ctx context.Context, targetType string, configJSON json.RawMessage) (target.Connector, error) {
 	switch targetType {
 	case "NGINX":
 		var cfg nginx.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid NGINX config: %w", err)
 			}
 		}
 		return nginx.New(&cfg, a.logger), nil
 	case "Apache":
 		var cfg apache.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Apache config: %w", err)
 			}
 		}
 		return apache.New(&cfg, a.logger), nil
 	case "HAProxy":
 		var cfg haproxy.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid HAProxy config: %w", err)
 			}
 		}
 		return haproxy.New(&cfg, a.logger), nil
 	case "F5":
 		var cfg f5.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid F5 config: %w", err)
 			}
 		}
 		conn, err := f5.New(&cfg, a.logger)
 		if err != nil {
 			return nil, fmt.Errorf("failed to create F5 connector: %w", err)
 		}
 		return conn, nil
 	case "IIS":
 		var cfg iis.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid IIS config: %w", err)
 			}
 		}
 		return iis.New(&cfg, a.logger)
 	case "Traefik":
 		var cfg traefik.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Traefik config: %w", err)
 			}
 		}
 		return traefik.New(&cfg, a.logger), nil
 	case "Caddy":
 		var cfg caddy.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Caddy config: %w", err)
 			}
 		}
 		return caddy.New(&cfg, a.logger), nil
 	case "Envoy":
 		var cfg envoy.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Envoy config: %w", err)
 			}
 		}
 		return envoy.New(&cfg, a.logger), nil
 	case "Postfix":
 		var cfg pf.Config
 		cfg.Mode = "postfix"
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Postfix config: %w", err)
 			}
 		}
 		return pf.New(&cfg, a.logger), nil
 	case "Dovecot":
 		var cfg pf.Config
 		cfg.Mode = "dovecot"
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Dovecot config: %w", err)
 			}
 		}
 		return pf.New(&cfg, a.logger), nil
 	case "SSH":
 		var cfg sshconn.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid SSH config: %w", err)
 			}
 		}
 		return sshconn.New(&cfg, a.logger)
 	case "WinCertStore":
 		var cfg wcs.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid WinCertStore config: %w", err)
 			}
 		}
 		return wcs.New(&cfg, a.logger)
 	case "JavaKeystore":
 		var cfg jks.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid JavaKeystore config: %w", err)
 			}
 		}
 		return jks.New(&cfg, a.logger), nil
 	case "KubernetesSecrets":
 		var cfg k8s.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid KubernetesSecrets config: %w", err)
 			}
 		}
 		return k8s.New(&cfg, a.logger)
 	case "AWSACM":
 		// Rank 5 of the 2026-05-03 Infisical deep-research deliverable.
 		// AWS Certificate Manager target — SDK-driven (no file I/O).
 		// LoadDefaultConfig handles the standard AWS credential chain
 		// (IRSA / EC2 instance profile / SSO / env vars) without any
 		// long-lived creds in connector Config.
 		var cfg awsacm.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid AWSACM config: %w", err)
 			}
 		}
 		return awsacm.New(ctx, &cfg, a.logger)
 	case "AzureKeyVault":
 		// Rank 5 of the 2026-05-03 Infisical deep-research deliverable.
 		// Azure Key Vault target — SDK-driven (no file I/O).
 		// DefaultAzureCredential handles the standard Azure credential
 		// chain (managed identity / workload identity / env vars / az
 		// CLI fallback). Long-lived service-principal secrets are
 		// supported but discouraged via the credential_mode config.
 		var cfg azurekv.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid AzureKeyVault config: %w", err)
 			}
 		}
 		return azurekv.New(ctx, &cfg, a.logger)
 	default:
 		return nil, fmt.Errorf("unsupported target type: %s", targetType)
 	}
 }
 // splitPEMChain splits a PEM chain into the first certificate (cert) and the rest (chain).
 // The control plane returns the full chain as a single string with PEM blocks concatenated.
 func splitPEMChain(pemChain string) (string, string) {
 	data := []byte(pemChain)
 	block, rest := pem.Decode(data)
 	if block == nil {
 		return pemChain, ""
 	}
 	cert := string(pem.EncodeToMemory(block))
 	// Skip whitespace between cert and chain
 	chain := strings.TrimSpace(string(rest))
 	if chain == "" {
 		return cert, ""
 	}
 	return cert, chain
 }
 // fetchCertificate retrieves the certificate PEM chain from the control plane.
 // GET /api/v1/agents/{agentID}/certificates/{certID}
 func (a *Agent) fetchCertificate(ctx context.Context, certID string) (string, error) {
 	path := fmt.Sprintf("/api/v1/agents/%s/certificates/%s", a.config.AgentID, certID)
 	resp, err := a.makeRequest(ctx, http.MethodGet, path, nil)
 	if err != nil {
 		return "", fmt.Errorf("request failed: %w", err)
 	}
 	defer resp.Body.Close()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
 		return "", fmt.Errorf("server returned %d: %s", resp.StatusCode, string(body))
 	}
 	var certResp struct {
 		CertificatePEM string `json:"certificate_pem"`
 	}
 	if err := json.NewDecoder(resp.Body).Decode(&certResp); err != nil {
 		return "", fmt.Errorf("failed to decode response: %w", err)
 	}
 	return certResp.CertificatePEM, nil
 }
@@ -0,0 +1,275 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package main
 import (
 	"context"
 	"crypto/ecdsa"
 	"crypto/rsa"
 	"crypto/sha256"
 	"crypto/x509"
 	"encoding/pem"
 	"fmt"
 	"io"
 	"net/http"
 	"os"
 	"path/filepath"
 	"strings"
 	"time"
 )
 // Phase 9 ARCH-M2 closure Sprint 12 (2026-05-14): extracted from
 // cmd/agent/main.go via the Option B sibling-file pattern.
 //
 // This file holds the filesystem DISCOVERY scan — the agent's
 // outbound surface for reporting pre-existing certificates it
 // finds on disk back to the control plane (POST /api/v1/agents/
 // {id}/discoveries, a machine-to-machine flow NOT exposed via the
 // MCP surface per the comment in
 // internal/mcp/tools.go::RegisterTools):
 //
 //   - runDiscoveryScan: walks each configured discovery directory,
 //     dispatches each candidate file to parsePEMFile or parseDERFile
 //     depending on extension, batches the parsed entries, and POSTs
 //     them in one report.
 //   - parsePEMFile / parseDERFile: extract every X.509 certificate
 //     from a candidate file in either encoding.
 //   - certToEntry: project a parsed *x509.Certificate into the
 //     discoveredCertEntry shape the control plane expects.
 //   - discoveredCertEntry struct + sha256Sum + certKeyInfo helpers
 //     consumed only by the discovery path; co-locating them keeps
 //     this file self-contained.
 // runDiscoveryScan walks configured directories, parses certificate files, and reports
 // discovered certificates to the control plane.
 // Supports PEM and DER encoded X.509 certificates.
 func (a *Agent) runDiscoveryScan(ctx context.Context) {
 	a.logger.Info("starting filesystem certificate discovery scan",
 		"directories", a.config.DiscoveryDirs)
 	startTime := time.Now()
 	var certs []discoveredCertEntry
 	var scanErrors []string
 	for _, dir := range a.config.DiscoveryDirs {
 		a.logger.Debug("scanning directory", "path", dir)
 		err := filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
 			if err != nil {
 				scanErrors = append(scanErrors, fmt.Sprintf("walk error at %s: %v", path, err))
 				return nil // continue walking
 			}
 			if info.IsDir() {
 				return nil
 			}
 			// Skip files larger than 1MB (unlikely to be a certificate)
 			if info.Size() > 1*1024*1024 {
 				return nil
 			}
 			// Check file extension
 			ext := strings.ToLower(filepath.Ext(path))
 			switch ext {
 			case ".pem", ".crt", ".cer", ".cert":
 				found := a.parsePEMFile(path)
 				certs = append(certs, found...)
 			case ".der":
 				if entry, err := a.parseDERFile(path); err == nil {
 					certs = append(certs, entry)
 				} else {
 					a.logger.Debug("skipping non-cert DER file", "path", path, "error", err)
 				}
 			default:
 				// Try PEM parsing for extensionless files or unknown extensions
 				if ext == "" || ext == ".key" {
 					return nil // skip key files and extensionless
 				}
 				found := a.parsePEMFile(path)
 				if len(found) > 0 {
 					certs = append(certs, found...)
 				}
 			}
 			return nil
 		})
 		if err != nil {
 			scanErrors = append(scanErrors, fmt.Sprintf("failed to walk %s: %v", dir, err))
 		}
 	}
 	scanDuration := time.Since(startTime)
 	a.logger.Info("discovery scan completed",
 		"certificates_found", len(certs),
 		"errors", len(scanErrors),
 		"duration_ms", scanDuration.Milliseconds())
 	if len(certs) == 0 && len(scanErrors) == 0 {
 		a.logger.Debug("no certificates found and no errors, skipping report")
 		return
 	}
 	// Build report payload
 	entries := make([]map[string]interface{}, len(certs))
 	for i, c := range certs {
 		entries[i] = map[string]interface{}{
 			"fingerprint_sha256": c.FingerprintSHA256,
 			"common_name":        c.CommonName,
 			"sans":               c.SANs,
 			"serial_number":      c.SerialNumber,
 			"issuer_dn":          c.IssuerDN,
 			"subject_dn":         c.SubjectDN,
 			"not_before":         c.NotBefore,
 			"not_after":          c.NotAfter,
 			"key_algorithm":      c.KeyAlgorithm,
 			"key_size":           c.KeySize,
 			"is_ca":              c.IsCA,
 			"pem_data":           c.PEMData,
 			"source_path":        c.SourcePath,
 			"source_format":      c.SourceFormat,
 		}
 	}
 	report := map[string]interface{}{
 		"agent_id":         a.config.AgentID,
 		"directories":      a.config.DiscoveryDirs,
 		"certificates":     entries,
 		"errors":           scanErrors,
 		"scan_duration_ms": int(scanDuration.Milliseconds()),
 	}
 	// Submit to control plane
 	path := fmt.Sprintf("/api/v1/agents/%s/discoveries", a.config.AgentID)
 	resp, err := a.makeRequest(ctx, http.MethodPost, path, report)
 	if err != nil {
 		a.logger.Error("failed to submit discovery report", "error", err)
 		return
 	}
 	defer resp.Body.Close()
 	if resp.StatusCode != http.StatusAccepted {
 		body, _ := io.ReadAll(resp.Body)
 		a.logger.Error("discovery report rejected",
 			"status", resp.StatusCode,
 			"body", string(body))
 		return
 	}
 	a.logger.Info("discovery report submitted successfully",
 		"certificates", len(certs),
 		"errors", len(scanErrors))
 }
 // discoveredCertEntry holds parsed certificate metadata for reporting.
 type discoveredCertEntry struct {
 	FingerprintSHA256 string   `json:"fingerprint_sha256"`
 	CommonName        string   `json:"common_name"`
 	SANs              []string `json:"sans"`
 	SerialNumber      string   `json:"serial_number"`
 	IssuerDN          string   `json:"issuer_dn"`
 	SubjectDN         string   `json:"subject_dn"`
 	NotBefore         string   `json:"not_before"`
 	NotAfter          string   `json:"not_after"`
 	KeyAlgorithm      string   `json:"key_algorithm"`
 	KeySize           int      `json:"key_size"`
 	IsCA              bool     `json:"is_ca"`
 	PEMData           string   `json:"pem_data"`
 	SourcePath        string   `json:"source_path"`
 	SourceFormat      string   `json:"source_format"`
 }
 // parsePEMFile reads a file and extracts all X.509 certificates from PEM blocks.
 func (a *Agent) parsePEMFile(path string) []discoveredCertEntry {
 	data, err := os.ReadFile(path)
 	if err != nil {
 		a.logger.Debug("failed to read file", "path", path, "error", err)
 		return nil
 	}
 	var entries []discoveredCertEntry
 	rest := data
 	for {
 		var block *pem.Block
 		block, rest = pem.Decode(rest)
 		if block == nil {
 			break
 		}
 		if block.Type != "CERTIFICATE" {
 			continue
 		}
 		cert, err := x509.ParseCertificate(block.Bytes)
 		if err != nil {
 			a.logger.Debug("failed to parse certificate in PEM", "path", path, "error", err)
 			continue
 		}
 		pemStr := string(pem.EncodeToMemory(block))
 		entries = append(entries, certToEntry(cert, path, "PEM", pemStr))
 	}
 	return entries
 }
 // parseDERFile reads a DER-encoded certificate file.
 func (a *Agent) parseDERFile(path string) (discoveredCertEntry, error) {
 	data, err := os.ReadFile(path)
 	if err != nil {
 		return discoveredCertEntry{}, fmt.Errorf("read failed: %w", err)
 	}
 	cert, err := x509.ParseCertificate(data)
 	if err != nil {
 		return discoveredCertEntry{}, fmt.Errorf("parse failed: %w", err)
 	}
 	// Convert to PEM for storage
 	pemStr := string(pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: data}))
 	return certToEntry(cert, path, "DER", pemStr), nil
 }
 // certToEntry converts a parsed x509.Certificate into a discoveredCertEntry.
 func certToEntry(cert *x509.Certificate, path, format, pemData string) discoveredCertEntry {
 	// Compute SHA-256 fingerprint
 	fingerprint := fmt.Sprintf("%x", sha256Sum(cert.Raw))
 	// Determine key algorithm and size
 	keyAlg, keySize := certKeyInfo(cert)
 	return discoveredCertEntry{
 		FingerprintSHA256: fingerprint,
 		CommonName:        cert.Subject.CommonName,
 		SANs:              cert.DNSNames,
 		SerialNumber:      cert.SerialNumber.Text(16),
 		IssuerDN:          cert.Issuer.String(),
 		SubjectDN:         cert.Subject.String(),
 		NotBefore:         cert.NotBefore.UTC().Format(time.RFC3339),
 		NotAfter:          cert.NotAfter.UTC().Format(time.RFC3339),
 		KeyAlgorithm:      keyAlg,
 		KeySize:           keySize,
 		IsCA:              cert.IsCA,
 		PEMData:           pemData,
 		SourcePath:        path,
 		SourceFormat:      format,
 	}
 }
 // sha256Sum returns the SHA-256 hash of data.
 func sha256Sum(data []byte) [32]byte {
 	return sha256.Sum256(data)
 }
 // certKeyInfo extracts key algorithm name and size from a certificate.
 func certKeyInfo(cert *x509.Certificate) (string, int) {
 	switch pub := cert.PublicKey.(type) {
 	case *ecdsa.PublicKey:
 		return "ECDSA", pub.Curve.Params().BitSize
 	case *rsa.PublicKey:
 		return "RSA", pub.N.BitLen()
 	default:
 		switch cert.PublicKeyAlgorithm {
 		case x509.Ed25519:
 			return "Ed25519", 256
 		default:
 			return cert.PublicKeyAlgorithm.String(), 0
 		}
 	}
 }
@@ -9,6 +9,8 @@ import (
 	"fmt"
 	"os"
 	"path/filepath"
 	"regexp"
 	"strings"
 )
 // Bundle-9 / Audit L-002 + L-003 (agent edition).
@@ -41,6 +43,87 @@ func marshalAgentKeyAndZeroize(priv *ecdsa.PrivateKey, onDER func([]byte) error)
 	return onDER(der)
 }
 // SEC-002 closure (Sprint 1, 2026-05-16). The agent derives an on-disk
 // key path from job.CertificateID via filepath.Join. Pre-fix, a
 // crafted certificate_id ("../../etc/passwd", "/absolute/path",
 // "abc\x00d", "..\\Windows\\path") would drive arbitrary file
 // write/read on the agent host. The shape regex below mirrors the
 // server-side internal/validation.ValidateCertificateID gate — both
 // ends MUST hold for the load-bearing defense (the server can't be
 // trusted in isolation; a compromised control plane could deliver a
 // crafted job).
 //
 // agentCertIDPattern accepts ASCII letters, digits, ".", "_", "-",
 // bounded to 128 chars. Existing prefixed IDs (mc-..., cert-..., etc.)
 // satisfy this trivially. Deliberately rejects path separators (POSIX
 // and Windows), NUL byte, whitespace, control characters, and the
 // bare relative-path tokens "." and "..".
 var agentCertIDPattern = regexp.MustCompile(`^[A-Za-z0-9._-]{1,128}$`)
 // validateAgentCertID returns an error if id is not a well-formed
 // certificate identifier. Mirrors internal/validation.ValidateCertificateID
 // — the duplication is deliberate per the package-level comment
 // ("cmd/agent is a separate binary; copy-paste cheaper than lifting
 // a shared internal/keystore for a single shape check").
 func validateAgentCertID(id string) error {
 	if id == "" {
 		return fmt.Errorf("certificate_id is required")
 	}
 	if len(id) > 128 {
 		return fmt.Errorf("certificate_id length %d exceeds 128", len(id))
 	}
 	if !agentCertIDPattern.MatchString(id) {
 		return fmt.Errorf("certificate_id %q contains disallowed characters", id)
 	}
 	if id == "." || id == ".." {
 		return fmt.Errorf("certificate_id %q is a relative-path token", id)
 	}
 	return nil
 }
 // safeAgentKeyPath returns the on-disk key path for the given
 // certificateID, after validating the ID shape AND asserting the
 // joined path is contained within keyDir. Containment is the
 // authoritative guard — even if validateAgentCertID is bypassed (e.g.
 // a future refactor removes it), the post-Clean rel-path check below
 // rejects any path that escapes keyDir.
 //
 // The two-leg defense:
 //
 //	leg 1: shape check (validateAgentCertID)  → cheap up-front fail
 //	leg 2: containment check (filepath.Rel)   → load-bearing guard
 //
 // Returns the joined path on success, or a non-nil error describing
 // the rejected vector.
 func safeAgentKeyPath(keyDir, certificateID string) (string, error) {
 	if err := validateAgentCertID(certificateID); err != nil {
 		return "", err
 	}
 	if keyDir == "" {
 		return "", fmt.Errorf("safeAgentKeyPath: empty keyDir")
 	}
 	cleanDir, err := filepath.Abs(filepath.Clean(keyDir))
 	if err != nil {
 		return "", fmt.Errorf("safeAgentKeyPath: resolve keyDir: %w", err)
 	}
 	joined := filepath.Join(cleanDir, certificateID+".key")
 	cleanJoined := filepath.Clean(joined)
 	rel, err := filepath.Rel(cleanDir, cleanJoined)
 	if err != nil {
 		return "", fmt.Errorf("safeAgentKeyPath: rel(%q,%q): %w", cleanDir, cleanJoined, err)
 	}
 	// Reject any path that escapes the directory: a leading ".." in the
 	// relative form means the joined path resolved outside keyDir.
 	if rel == ".." || strings.HasPrefix(rel, ".."+string(filepath.Separator)) {
 		return "", fmt.Errorf("safeAgentKeyPath: %q escapes keyDir %q (rel=%q)", certificateID, cleanDir, rel)
 	}
 	// Belt-and-suspenders: the rel form must also not contain a NUL.
 	if strings.ContainsRune(rel, 0) {
 		return "", fmt.Errorf("safeAgentKeyPath: NUL byte in computed path")
 	}
 	return cleanJoined, nil
 }
 // ensureAgentKeyDirSecure creates dir (and ancestors) with mode 0700 or
 // asserts an existing dir is owner-only. If a pre-existing dir is more
 // permissive than 0700 we tighten it to 0700 (logging-free; this is a
@@ -716,3 +716,113 @@ func TestKeymem_AgentMainFlowSmoke(t *testing.T) {
 		}
 	}
 }
 // =============================================================================
 // SEC-002 closure (Sprint 1, 2026-05-16) — safeAgentKeyPath path-traversal
 // regression coverage.
 //
 // Pre-fix the agent built the on-disk key path via:
 //
 //	keyPath := filepath.Join(a.config.KeyDir, job.CertificateID+".key")
 //
 // migrations/000001_initial_schema.up.sql declares
 // managed_certificates.id as TEXT PRIMARY KEY with no shape constraint, so
 // a crafted certificate_id from a compromised control plane (or a poisoned
 // DB row) could land outside KeyDir. The fix:
 //
 //   - validateAgentCertID rejects shape violations up-front
 //   - safeAgentKeyPath additionally asserts the joined path is contained
 //     within KeyDir via filepath.Rel; even a future refactor that drops
 //     the shape regex would still fail closed on escape.
 //
 // These tests pin both legs against the four vectors called out in the
 // audit (../../etc/passwd, /absolute/path, NUL byte, Windows separators).
 // =============================================================================
 func TestValidateAgentCertID_AcceptsCanonicalShapes(t *testing.T) {
 	for _, id := range []string{
 		"mc-cdn-edge",
 		"mc-cdn-edge-2026.q1",
 		"cert-1",
 		"abc123",
 		"MC-UPPER",
 	} {
 		t.Run(id, func(t *testing.T) {
 			if err := validateAgentCertID(id); err != nil {
 				t.Errorf("validateAgentCertID(%q): unexpected error %v", id, err)
 			}
 		})
 	}
 }
 func TestValidateAgentCertID_RejectsTraversalVectors(t *testing.T) {
 	cases := []struct {
 		name string
 		id   string
 	}{
 		{"empty", ""},
 		{"parent_token", ".."},
 		{"current_token", "."},
 		{"posix_traversal", "../../etc/passwd"},
 		{"absolute_posix", "/absolute/path"},
 		{"windows_traversal", `..\..\evil`},
 		{"windows_separator", `bad\path`},
 		{"nul_byte", "abc\x00def"},
 		{"newline", "abc\ndef"},
 		{"space", "id with spaces"},
 		{"overlong", strings.Repeat("a", 129)},
 	}
 	for _, tc := range cases {
 		t.Run(tc.name, func(t *testing.T) {
 			if err := validateAgentCertID(tc.id); err == nil {
 				t.Errorf("id=%q: expected rejection, got nil", tc.id)
 			}
 		})
 	}
 }
 func TestSafeAgentKeyPath_HappyPath_ProducesContainedPath(t *testing.T) {
 	keyDir := t.TempDir()
 	got, err := safeAgentKeyPath(keyDir, "mc-good")
 	if err != nil {
 		t.Fatalf("safeAgentKeyPath: %v", err)
 	}
 	want := filepath.Join(keyDir, "mc-good.key")
 	// filepath.Clean normalisation may strip a trailing separator, etc.;
 	// compare canonical forms.
 	if filepath.Clean(got) != filepath.Clean(want) {
 		t.Errorf("safeAgentKeyPath = %q; want %q", got, want)
 	}
 }
 func TestSafeAgentKeyPath_RejectsTraversalVectors(t *testing.T) {
 	keyDir := t.TempDir()
 	cases := []struct {
 		name string
 		id   string
 	}{
 		{"posix_traversal", "../../etc/passwd"},
 		{"absolute_posix", "/etc/passwd"},
 		{"parent_token", ".."},
 		{"current_token", "."},
 		{"windows_traversal", `..\..\evil`},
 		{"windows_separator", `bad\path`},
 		{"nul_byte", "abc\x00def"},
 		{"empty", ""},
 	}
 	for _, tc := range cases {
 		t.Run(tc.name, func(t *testing.T) {
 			_, err := safeAgentKeyPath(keyDir, tc.id)
 			if err == nil {
 				t.Errorf("id=%q: expected rejection, got nil", tc.id)
 			}
 		})
 	}
 }
 func TestSafeAgentKeyPath_RejectsEmptyKeyDir(t *testing.T) {
 	_, err := safeAgentKeyPath("", "mc-good")
 	if err == nil {
 		t.Errorf("empty keyDir: expected rejection, got nil")
 	}
 }
@@ -6,49 +6,27 @@ package main
 import (
 	"bytes"
 	"context"
 	"crypto/ecdsa"
 	"crypto/elliptic"
 	"crypto/rand"
 	"crypto/rsa"
 	"crypto/sha256"
 	"crypto/tls"
 	"crypto/x509"
 	"crypto/x509/pkix"
 	"encoding/json"
 	"encoding/pem"
 	"errors"
 	"flag"
 	"fmt"
 	"io"
 	"log/slog"
 	"math/rand/v2"
 	"net"
 	"net/http"
 	"net/url"
 	"os"
 	"os/signal"
 	"path/filepath"
 	"runtime"
 	"strings"
 	"sync"
 	"syscall"
 	"time"
-	"github.com/certctl-io/certctl/internal/connector/target"
+	"github.com/certctl-io/certctl/internal/scheduler"
 	"github.com/certctl-io/certctl/internal/connector/target/apache"
 	"github.com/certctl-io/certctl/internal/connector/target/awsacm"
 	"github.com/certctl-io/certctl/internal/connector/target/azurekv"
 	"github.com/certctl-io/certctl/internal/connector/target/caddy"
 	"github.com/certctl-io/certctl/internal/connector/target/envoy"
 	"github.com/certctl-io/certctl/internal/connector/target/f5"
 	"github.com/certctl-io/certctl/internal/connector/target/haproxy"
 	"github.com/certctl-io/certctl/internal/connector/target/iis"
 	jks "github.com/certctl-io/certctl/internal/connector/target/javakeystore"
 	k8s "github.com/certctl-io/certctl/internal/connector/target/k8ssecret"
 	"github.com/certctl-io/certctl/internal/connector/target/nginx"
 	pf "github.com/certctl-io/certctl/internal/connector/target/postfix"
 	sshconn "github.com/certctl-io/certctl/internal/connector/target/ssh"
 	"github.com/certctl-io/certctl/internal/connector/target/traefik"
 	wcs "github.com/certctl-io/certctl/internal/connector/target/wincertstore"
 )
 // AgentConfig represents the agent-side configuration.
@@ -256,15 +234,49 @@ func (a *Agent) Run(ctx context.Context) error {
 		a.logger.Warn("failed to enforce key directory permissions", "path", a.config.KeyDir, "error", err)
 	}
-	// Create ticker channels for heartbeat, polling, and discovery
+	// SCALE-006 closure (Sprint 2, 2026-05-16). Pre-fix the agent
-	heartbeatTicker := time.NewTicker(a.heartbeatInterval)
+	// started its heartbeat + poll loops on fixed time.NewTicker
 	// cadence with an unjittered immediate first invocation. Mass
 	// restarts (rolling K8s deploy, control-plane reboot, scheduled
 	// fleet bounce) produced a thundering herd — 5K agents booting
 	// in a 10-second window all hit /heartbeat in lockstep, then
 	// /poll, every interval forever afterward.
 	//
 	// Fix: (1) sleep a random startup-jitter ∈ [0, interval) before
 	// the first heartbeat + first poll to spread the initial cohort,
 	// and (2) use scheduler.JitteredTicker (±10% per-tick envelope)
 	// for the recurring ticks so the cohort stays spread across
 	// every tick boundary. Both legs use the existing in-tree
 	// JitteredTicker primitive (internal/scheduler/jitter.go) —
 	// pattern already exercised by every scheduler.go loop on the
 	// server side.
 	heartbeatTicker := scheduler.NewJitteredTicker(a.heartbeatInterval, scheduler.DefaultSchedulerJitter)
 	defer heartbeatTicker.Stop()
-
+	pollTicker := scheduler.NewJitteredTicker(a.pollInterval, scheduler.DefaultSchedulerJitter)
 	pollTicker := time.NewTicker(a.pollInterval)
 	defer pollTicker.Stop()
-	// Run initial heartbeat and poll
+	// Startup jitter — run-first delay drawn fresh per-agent so a
 	// 5K-agent rolling-restart spreads out across (max interval).
 	// Bounded by ctx so a sigint-during-startup exits cleanly rather
 	// than hanging on the Sleep. Heartbeat and poll are drawn
 	// independently so a single random seed doesn't create a
 	// secondary correlation pattern.
 	hbJitter := time.Duration(rand.Int64N(int64(a.heartbeatInterval)))
 	pollJitter := time.Duration(rand.Int64N(int64(a.pollInterval)))
 	a.logger.Info("startup jitter applied",
 		"heartbeat_jitter", hbJitter.String(),
 		"poll_jitter", pollJitter.String())
 	select {
 	case <-ctx.Done():
 		return ctx.Err()
 	case <-time.After(hbJitter):
 	}
 	a.sendHeartbeat(ctx)
 	select {
 	case <-ctx.Done():
 		return ctx.Err()
 	case <-time.After(pollJitter):
 	}
 	a.pollForWork(ctx)
 	// Discovery: run initial scan if directories configured, then on interval
@@ -394,618 +406,6 @@ func (a *Agent) sendHeartbeat(ctx context.Context) {
 	a.logger.Debug("heartbeat acknowledged")
 }
 // pollForWork queries the control plane for actionable jobs and processes them.
 // Jobs may be deployment jobs (Pending) or CSR jobs (AwaitingCSR).
 // GET /api/v1/agents/{agentID}/work
 func (a *Agent) pollForWork(ctx context.Context) {
 	a.logger.Debug("polling for work", "agent_id", a.config.AgentID)
 	path := fmt.Sprintf("/api/v1/agents/%s/work", a.config.AgentID)
 	resp, err := a.makeRequest(ctx, http.MethodGet, path, nil)
 	if err != nil {
 		a.logger.Error("work poll failed", "error", err)
 		a.consecutiveFailures++
 		return
 	}
 	defer resp.Body.Close()
 	// I-004: same terminal-retirement handling as sendHeartbeat. Work-poll is the
 	// other hot path that can observe an agent's soft-retirement; if the
 	// heartbeat tick happens to fire after a work-poll tick within the same
 	// retirement window, this branch catches it first. markRetired's sync.Once
 	// guards idempotency so racing both paths in the same tick only closes the
 	// signal channel once. No consecutiveFailures increment — retirement is
 	// not a transient failure.
 	if resp.StatusCode == http.StatusGone {
 		body, _ := io.ReadAll(resp.Body)
 		a.markRetired("work_poll", resp.StatusCode, string(body))
 		return
 	}
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
 		a.logger.Error("work poll rejected",
 			"status", resp.StatusCode,
 			"body", string(body))
 		a.consecutiveFailures++
 		return
 	}
 	var workResp WorkResponse
 	if err := json.NewDecoder(resp.Body).Decode(&workResp); err != nil {
 		a.logger.Error("failed to decode work response", "error", err)
 		a.consecutiveFailures++
 		return
 	}
 	a.consecutiveFailures = 0
 	if workResp.Count == 0 {
 		a.logger.Debug("no pending work")
 		return
 	}
 	a.logger.Info("received work", "job_count", workResp.Count)
 	// Process each job based on type and status
 	for _, job := range workResp.Jobs {
 		switch {
 		case job.Status == "AwaitingCSR":
 			// Agent keygen mode: generate key locally, create CSR, submit to server
 			a.executeCSRJob(ctx, job)
 		case job.Type == "Deployment":
 			a.executeDeploymentJob(ctx, job)
 		}
 	}
 }
 // executeCSRJob handles an AwaitingCSR job: generates a private key locally, creates a CSR,
 // and submits it to the control plane for signing. The private key is stored on the local
 // filesystem with 0600 permissions and NEVER sent to the server.
 //
 // Flow:
 // 1. Generate ECDSA P-256 key pair
 // 2. Store private key to disk (keyDir/certID.key) with 0600 permissions
 // 3. Create CSR with common name and SANs from work response
 // 4. Submit CSR to control plane via POST /agents/{id}/csr
 // 5. Server signs the CSR and creates a cert version + deployment jobs
 func (a *Agent) executeCSRJob(ctx context.Context, job JobItem) {
 	a.logger.Info("executing CSR job (agent-side key generation)",
 		"job_id", job.ID,
 		"certificate_id", job.CertificateID,
 		"common_name", job.CommonName)
 	// Step 1: Generate ECDSA P-256 key pair
 	privKey, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
 	if err != nil {
 		a.logger.Error("failed to generate private key",
 			"job_id", job.ID,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key generation failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	a.logger.Info("generated ECDSA P-256 key pair locally",
 		"job_id", job.ID,
 		"certificate_id", job.CertificateID)
 	// Step 2: Store private key to disk with secure permissions.
 	//
 	// Bundle-9 / Audit L-002 + L-003: marshal+write through helpers that
 	// (a) zeroize the in-heap DER buffer immediately after the PEM block is
 	// constructed so the private scalar's exposure window is bounded by
 	// this function call, and (b) assert the key directory is mode 0700
 	// before any write touches disk. Also defer-clear the PEM buffer for
 	// the same reason — the encoded key isn't sensitive in transit (it's
 	// going to disk) but lingers on the heap if we don't.
 	keyPath := filepath.Join(a.config.KeyDir, job.CertificateID+".key")
 	if err := ensureAgentKeyDirSecure(filepath.Dir(keyPath)); err != nil {
 		a.logger.Error("agent key dir hardening failed", "job_id", job.ID, "error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key dir hardening failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	var privKeyPEM []byte
 	if marshalErr := marshalAgentKeyAndZeroize(privKey, func(der []byte) error {
 		privKeyPEM = pem.EncodeToMemory(&pem.Block{
 			Type:  "EC PRIVATE KEY",
 			Bytes: der,
 		})
 		return nil
 	}); marshalErr != nil {
 		a.logger.Error("failed to marshal private key",
 			"job_id", job.ID,
 			"error", marshalErr)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key marshal failed: %v", marshalErr)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	defer clear(privKeyPEM)
 	if err := os.WriteFile(keyPath, privKeyPEM, 0600); err != nil {
 		a.logger.Error("failed to write private key to disk",
 			"job_id", job.ID,
 			"key_path", keyPath,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key storage failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	a.logger.Info("private key stored securely",
 		"job_id", job.ID,
 		"key_path", keyPath,
 		"permissions", "0600")
 	// Validate common name is present
 	if job.CommonName == "" {
 		a.logger.Error("empty common name in CSR job", "job_id", job.ID)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", "empty common name"); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "error", reportErr)
 		}
 		return
 	}
 	// Step 3: Create CSR with common name and SANs
 	// Split SANs into DNS names and email addresses for proper CSR encoding
 	var dnsNames []string
 	var emailAddresses []string
 	for _, san := range job.SANs {
 		if strings.Contains(san, "@") {
 			emailAddresses = append(emailAddresses, san)
 		} else {
 			dnsNames = append(dnsNames, san)
 		}
 	}
 	csrTemplate := &x509.CertificateRequest{
 		Subject: pkix.Name{
 			CommonName: job.CommonName,
 		},
 		DNSNames:       dnsNames,
 		EmailAddresses: emailAddresses,
 	}
 	csrDER, err := x509.CreateCertificateRequest(rand.Reader, csrTemplate, privKey)
 	if err != nil {
 		a.logger.Error("failed to create CSR",
 			"job_id", job.ID,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("CSR creation failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	csrPEM := string(pem.EncodeToMemory(&pem.Block{
 		Type:  "CERTIFICATE REQUEST",
 		Bytes: csrDER,
 	}))
 	// Step 4: Submit CSR to the control plane (only the public key leaves the agent)
 	a.logger.Info("submitting CSR to control plane",
 		"job_id", job.ID,
 		"certificate_id", job.CertificateID)
 	submitPath := fmt.Sprintf("/api/v1/agents/%s/csr", a.config.AgentID)
 	resp, err := a.makeRequest(ctx, http.MethodPost, submitPath, map[string]string{
 		"csr_pem":        csrPEM,
 		"certificate_id": job.CertificateID,
 	})
 	if err != nil {
 		a.logger.Error("failed to submit CSR",
 			"job_id", job.ID,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("CSR submission failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	defer resp.Body.Close()
 	if resp.StatusCode != http.StatusAccepted {
 		body, _ := io.ReadAll(resp.Body)
 		a.logger.Error("CSR submission rejected",
 			"job_id", job.ID,
 			"status", resp.StatusCode,
 			"body", string(body))
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("CSR rejected: %s", string(body))); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	a.logger.Info("CSR submitted and signed successfully",
 		"job_id", job.ID,
 		"certificate_id", job.CertificateID,
 		"key_path", keyPath)
 }
 // executeDeploymentJob executes a deployment job by fetching the certificate and deploying it
 // to the target system using the appropriate connector (NGINX, F5 BIG-IP, or IIS).
 //
 // For agent keygen mode, the private key is read from the local key store (keyDir/certID.key)
 // rather than fetched from the server. The deployment includes the locally-held key.
 //
 // Flow:
 // 1. Report job as Running
 // 2. Fetch the certificate PEM from the control plane
 // 3. Load local private key if it exists (agent keygen mode)
 // 4. Instantiate the target connector based on target_type from the work response
 // 5. Call DeployCertificate on the connector
 // 6. Report job as Completed (or Failed)
 func (a *Agent) executeDeploymentJob(ctx context.Context, job JobItem) {
 	a.logger.Info("executing deployment job",
 		"job_id", job.ID,
 		"certificate_id", job.CertificateID,
 		"target_type", job.TargetType)
 	// Report job as running
 	if err := a.reportJobStatus(ctx, job.ID, "Running", ""); err != nil {
 		a.logger.Error("failed to report job running", "error", err)
 	}
 	// Fetch the certificate from the control plane
 	certPEM, err := a.fetchCertificate(ctx, job.CertificateID)
 	if err != nil {
 		a.logger.Error("failed to fetch certificate",
 			"job_id", job.ID,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("cert fetch failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	a.logger.Info("certificate fetched for deployment",
 		"job_id", job.ID,
 		"cert_length", len(certPEM))
 	// Split PEM into cert and chain (separated by double newline between PEM blocks)
 	certOnly, chainPEM := splitPEMChain(certPEM)
 	// Check for locally-stored private key (agent keygen mode)
 	keyPath := filepath.Join(a.config.KeyDir, job.CertificateID+".key")
 	var keyPEM string
 	keyData, err := os.ReadFile(keyPath)
 	if err != nil {
 		a.logger.Error("failed to read local private key for deployment",
 			"job_id", job.ID,
 			"key_path", keyPath,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key read failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "error", reportErr)
 		}
 		return
 	}
 	keyPEM = string(keyData)
 	a.logger.Info("loaded local private key for deployment",
 		"job_id", job.ID,
 		"key_path", keyPath)
 	// Deploy to the target using the appropriate connector
 	if job.TargetType != "" {
 		connector, err := a.createTargetConnector(ctx, job.TargetType, job.TargetConfig)
 		if err != nil {
 			a.logger.Error("failed to create target connector",
 				"job_id", job.ID,
 				"target_type", job.TargetType,
 				"error", err)
 			if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("connector init failed: %v", err)); reportErr != nil {
 				a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 			}
 			return
 		}
 		// Bundle 1 / RT-C1 closure (2026-05-12): defense in depth. The server
 		// runs internal/connector/target/configcheck.Validate on the way IN
 		// (Create/Update), and rejects shell metacharacters in command-bearing
 		// fields. Re-run the connector's full ValidateConfig here on the way
 		// OUT, before any DeployCertificate call. This catches (a) configs
 		// that pre-date the server-side guard, (b) corruption/tampering of
 		// the encrypted config blob, and (c) per-connector filesystem
 		// invariants (cert dir exists, paths writable) that the server can't
 		// check because the filesystem is on the agent host.
 		if err := connector.ValidateConfig(ctx, job.TargetConfig); err != nil {
 			a.logger.Error("connector config validation failed",
 				"job_id", job.ID,
 				"target_type", job.TargetType,
 				"error", err)
 			if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("%s config validation failed: %v", job.TargetType, err)); reportErr != nil {
 				a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 			}
 			return
 		}
 		deployReq := target.DeploymentRequest{
 			CertPEM:      certOnly,
 			KeyPEM:       keyPEM,
 			ChainPEM:     chainPEM,
 			TargetConfig: job.TargetConfig,
 			Metadata: map[string]string{
 				"certificate_id": job.CertificateID,
 				"job_id":         job.ID,
 			},
 		}
 		// Phase 2 of the deploy-hardening I master bundle:
 		// per-target deploy mutex. Acquire BEFORE
 		// DeployCertificate so two concurrent renewals against
 		// the same target ID serialize. The lock is held for the
 		// full Deploy duration including PreCommit (validate),
 		// PostCommit (reload), and post-deploy verify (Phases
 		// 4-9). Released on every return path via defer.
 		var targetID string
 		if job.TargetID != nil {
 			targetID = *job.TargetID
 		}
 		if mu := a.targetDeployMutex(targetID); mu != nil {
 			mu.Lock()
 			defer mu.Unlock()
 		}
 		result, err := connector.DeployCertificate(ctx, deployReq)
 		if err != nil {
 			a.logger.Error("deployment failed",
 				"job_id", job.ID,
 				"target_type", job.TargetType,
 				"error", err)
 			if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("deployment failed: %v", err)); reportErr != nil {
 				a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 			}
 			return
 		}
 		a.logger.Info("target connector deployment completed",
 			"job_id", job.ID,
 			"target_type", job.TargetType,
 			"success", result.Success,
 			"message", result.Message)
 		// If verification is enabled, verify the deployment by probing the live TLS endpoint
 		targetHost, targetPort, err := extractTargetHostAndPort(job.TargetConfig)
 		if err != nil {
 			a.logger.Warn("could not extract target host/port for verification",
 				"job_id", job.ID,
 				"error", err)
 		} else {
 			a.verifyAndReportDeployment(ctx, job, targetHost, targetPort, certOnly)
 		}
 	} else {
 		a.logger.Info("no target type specified, skipping connector invocation",
 			"job_id", job.ID)
 	}
 	// Report job as completed
 	if err := a.reportJobStatus(ctx, job.ID, "Completed", ""); err != nil {
 		a.logger.Error("failed to report job completed", "error", err)
 		return
 	}
 	a.logger.Info("deployment job completed", "job_id", job.ID)
 }
 // createTargetConnector instantiates the appropriate target connector based on type.
 // ctx is threaded into SDK-driven connectors (AWSACM, AzureKeyVault) so credential
 // resolution honors caller cancellation / deadlines instead of using a fresh
 // context.Background() (the contextcheck linter enforces this — the original Rank 5
 // implementation used Background() and tripped CI on commit 502823d).
 func (a *Agent) createTargetConnector(ctx context.Context, targetType string, configJSON json.RawMessage) (target.Connector, error) {
 	switch targetType {
 	case "NGINX":
 		var cfg nginx.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid NGINX config: %w", err)
 			}
 		}
 		return nginx.New(&cfg, a.logger), nil
 	case "Apache":
 		var cfg apache.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Apache config: %w", err)
 			}
 		}
 		return apache.New(&cfg, a.logger), nil
 	case "HAProxy":
 		var cfg haproxy.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid HAProxy config: %w", err)
 			}
 		}
 		return haproxy.New(&cfg, a.logger), nil
 	case "F5":
 		var cfg f5.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid F5 config: %w", err)
 			}
 		}
 		conn, err := f5.New(&cfg, a.logger)
 		if err != nil {
 			return nil, fmt.Errorf("failed to create F5 connector: %w", err)
 		}
 		return conn, nil
 	case "IIS":
 		var cfg iis.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid IIS config: %w", err)
 			}
 		}
 		return iis.New(&cfg, a.logger)
 	case "Traefik":
 		var cfg traefik.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Traefik config: %w", err)
 			}
 		}
 		return traefik.New(&cfg, a.logger), nil
 	case "Caddy":
 		var cfg caddy.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Caddy config: %w", err)
 			}
 		}
 		return caddy.New(&cfg, a.logger), nil
 	case "Envoy":
 		var cfg envoy.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Envoy config: %w", err)
 			}
 		}
 		return envoy.New(&cfg, a.logger), nil
 	case "Postfix":
 		var cfg pf.Config
 		cfg.Mode = "postfix"
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Postfix config: %w", err)
 			}
 		}
 		return pf.New(&cfg, a.logger), nil
 	case "Dovecot":
 		var cfg pf.Config
 		cfg.Mode = "dovecot"
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid Dovecot config: %w", err)
 			}
 		}
 		return pf.New(&cfg, a.logger), nil
 	case "SSH":
 		var cfg sshconn.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid SSH config: %w", err)
 			}
 		}
 		return sshconn.New(&cfg, a.logger)
 	case "WinCertStore":
 		var cfg wcs.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid WinCertStore config: %w", err)
 			}
 		}
 		return wcs.New(&cfg, a.logger)
 	case "JavaKeystore":
 		var cfg jks.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid JavaKeystore config: %w", err)
 			}
 		}
 		return jks.New(&cfg, a.logger), nil
 	case "KubernetesSecrets":
 		var cfg k8s.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid KubernetesSecrets config: %w", err)
 			}
 		}
 		return k8s.New(&cfg, a.logger)
 	case "AWSACM":
 		// Rank 5 of the 2026-05-03 Infisical deep-research deliverable.
 		// AWS Certificate Manager target — SDK-driven (no file I/O).
 		// LoadDefaultConfig handles the standard AWS credential chain
 		// (IRSA / EC2 instance profile / SSO / env vars) without any
 		// long-lived creds in connector Config.
 		var cfg awsacm.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid AWSACM config: %w", err)
 			}
 		}
 		return awsacm.New(ctx, &cfg, a.logger)
 	case "AzureKeyVault":
 		// Rank 5 of the 2026-05-03 Infisical deep-research deliverable.
 		// Azure Key Vault target — SDK-driven (no file I/O).
 		// DefaultAzureCredential handles the standard Azure credential
 		// chain (managed identity / workload identity / env vars / az
 		// CLI fallback). Long-lived service-principal secrets are
 		// supported but discouraged via the credential_mode config.
 		var cfg azurekv.Config
 		if len(configJSON) > 0 {
 			if err := json.Unmarshal(configJSON, &cfg); err != nil {
 				return nil, fmt.Errorf("invalid AzureKeyVault config: %w", err)
 			}
 		}
 		return azurekv.New(ctx, &cfg, a.logger)
 	default:
 		return nil, fmt.Errorf("unsupported target type: %s", targetType)
 	}
 }
 // splitPEMChain splits a PEM chain into the first certificate (cert) and the rest (chain).
 // The control plane returns the full chain as a single string with PEM blocks concatenated.
 func splitPEMChain(pemChain string) (string, string) {
 	data := []byte(pemChain)
 	block, rest := pem.Decode(data)
 	if block == nil {
 		return pemChain, ""
 	}
 	cert := string(pem.EncodeToMemory(block))
 	// Skip whitespace between cert and chain
 	chain := strings.TrimSpace(string(rest))
 	if chain == "" {
 		return cert, ""
 	}
 	return cert, chain
 }
 // fetchCertificate retrieves the certificate PEM chain from the control plane.
 // GET /api/v1/agents/{agentID}/certificates/{certID}
 func (a *Agent) fetchCertificate(ctx context.Context, certID string) (string, error) {
 	path := fmt.Sprintf("/api/v1/agents/%s/certificates/%s", a.config.AgentID, certID)
 	resp, err := a.makeRequest(ctx, http.MethodGet, path, nil)
 	if err != nil {
 		return "", fmt.Errorf("request failed: %w", err)
 	}
 	defer resp.Body.Close()
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
 		return "", fmt.Errorf("server returned %d: %s", resp.StatusCode, string(body))
 	}
 	var certResp struct {
 		CertificatePEM string `json:"certificate_pem"`
 	}
 	if err := json.NewDecoder(resp.Body).Decode(&certResp); err != nil {
 		return "", fmt.Errorf("failed to decode response: %w", err)
 	}
 	return certResp.CertificatePEM, nil
 }
 // reportJobStatus reports the result of a job back to the control plane.
 // POST /api/v1/agents/{agentID}/jobs/{jobID}/status
 func (a *Agent) reportJobStatus(ctx context.Context, jobID string, status string, errorMsg string) error {
@@ -1067,239 +467,6 @@ func (a *Agent) makeRequest(ctx context.Context, method, path string, body inter
 	return resp, nil
 }
 // runDiscoveryScan walks configured directories, parses certificate files, and reports
 // discovered certificates to the control plane.
 // Supports PEM and DER encoded X.509 certificates.
 func (a *Agent) runDiscoveryScan(ctx context.Context) {
 	a.logger.Info("starting filesystem certificate discovery scan",
 		"directories", a.config.DiscoveryDirs)
 	startTime := time.Now()
 	var certs []discoveredCertEntry
 	var scanErrors []string
 	for _, dir := range a.config.DiscoveryDirs {
 		a.logger.Debug("scanning directory", "path", dir)
 		err := filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
 			if err != nil {
 				scanErrors = append(scanErrors, fmt.Sprintf("walk error at %s: %v", path, err))
 				return nil // continue walking
 			}
 			if info.IsDir() {
 				return nil
 			}
 			// Skip files larger than 1MB (unlikely to be a certificate)
 			if info.Size() > 1*1024*1024 {
 				return nil
 			}
 			// Check file extension
 			ext := strings.ToLower(filepath.Ext(path))
 			switch ext {
 			case ".pem", ".crt", ".cer", ".cert":
 				found := a.parsePEMFile(path)
 				certs = append(certs, found...)
 			case ".der":
 				if entry, err := a.parseDERFile(path); err == nil {
 					certs = append(certs, entry)
 				} else {
 					a.logger.Debug("skipping non-cert DER file", "path", path, "error", err)
 				}
 			default:
 				// Try PEM parsing for extensionless files or unknown extensions
 				if ext == "" || ext == ".key" {
 					return nil // skip key files and extensionless
 				}
 				found := a.parsePEMFile(path)
 				if len(found) > 0 {
 					certs = append(certs, found...)
 				}
 			}
 			return nil
 		})
 		if err != nil {
 			scanErrors = append(scanErrors, fmt.Sprintf("failed to walk %s: %v", dir, err))
 		}
 	}
 	scanDuration := time.Since(startTime)
 	a.logger.Info("discovery scan completed",
 		"certificates_found", len(certs),
 		"errors", len(scanErrors),
 		"duration_ms", scanDuration.Milliseconds())
 	if len(certs) == 0 && len(scanErrors) == 0 {
 		a.logger.Debug("no certificates found and no errors, skipping report")
 		return
 	}
 	// Build report payload
 	entries := make([]map[string]interface{}, len(certs))
 	for i, c := range certs {
 		entries[i] = map[string]interface{}{
 			"fingerprint_sha256": c.FingerprintSHA256,
 			"common_name":        c.CommonName,
 			"sans":               c.SANs,
 			"serial_number":      c.SerialNumber,
 			"issuer_dn":          c.IssuerDN,
 			"subject_dn":         c.SubjectDN,
 			"not_before":         c.NotBefore,
 			"not_after":          c.NotAfter,
 			"key_algorithm":      c.KeyAlgorithm,
 			"key_size":           c.KeySize,
 			"is_ca":              c.IsCA,
 			"pem_data":           c.PEMData,
 			"source_path":        c.SourcePath,
 			"source_format":      c.SourceFormat,
 		}
 	}
 	report := map[string]interface{}{
 		"agent_id":         a.config.AgentID,
 		"directories":      a.config.DiscoveryDirs,
 		"certificates":     entries,
 		"errors":           scanErrors,
 		"scan_duration_ms": int(scanDuration.Milliseconds()),
 	}
 	// Submit to control plane
 	path := fmt.Sprintf("/api/v1/agents/%s/discoveries", a.config.AgentID)
 	resp, err := a.makeRequest(ctx, http.MethodPost, path, report)
 	if err != nil {
 		a.logger.Error("failed to submit discovery report", "error", err)
 		return
 	}
 	defer resp.Body.Close()
 	if resp.StatusCode != http.StatusAccepted {
 		body, _ := io.ReadAll(resp.Body)
 		a.logger.Error("discovery report rejected",
 			"status", resp.StatusCode,
 			"body", string(body))
 		return
 	}
 	a.logger.Info("discovery report submitted successfully",
 		"certificates", len(certs),
 		"errors", len(scanErrors))
 }
 // discoveredCertEntry holds parsed certificate metadata for reporting.
 type discoveredCertEntry struct {
 	FingerprintSHA256 string   `json:"fingerprint_sha256"`
 	CommonName        string   `json:"common_name"`
 	SANs              []string `json:"sans"`
 	SerialNumber      string   `json:"serial_number"`
 	IssuerDN          string   `json:"issuer_dn"`
 	SubjectDN         string   `json:"subject_dn"`
 	NotBefore         string   `json:"not_before"`
 	NotAfter          string   `json:"not_after"`
 	KeyAlgorithm      string   `json:"key_algorithm"`
 	KeySize           int      `json:"key_size"`
 	IsCA              bool     `json:"is_ca"`
 	PEMData           string   `json:"pem_data"`
 	SourcePath        string   `json:"source_path"`
 	SourceFormat      string   `json:"source_format"`
 }
 // parsePEMFile reads a file and extracts all X.509 certificates from PEM blocks.
 func (a *Agent) parsePEMFile(path string) []discoveredCertEntry {
 	data, err := os.ReadFile(path)
 	if err != nil {
 		a.logger.Debug("failed to read file", "path", path, "error", err)
 		return nil
 	}
 	var entries []discoveredCertEntry
 	rest := data
 	for {
 		var block *pem.Block
 		block, rest = pem.Decode(rest)
 		if block == nil {
 			break
 		}
 		if block.Type != "CERTIFICATE" {
 			continue
 		}
 		cert, err := x509.ParseCertificate(block.Bytes)
 		if err != nil {
 			a.logger.Debug("failed to parse certificate in PEM", "path", path, "error", err)
 			continue
 		}
 		pemStr := string(pem.EncodeToMemory(block))
 		entries = append(entries, certToEntry(cert, path, "PEM", pemStr))
 	}
 	return entries
 }
 // parseDERFile reads a DER-encoded certificate file.
 func (a *Agent) parseDERFile(path string) (discoveredCertEntry, error) {
 	data, err := os.ReadFile(path)
 	if err != nil {
 		return discoveredCertEntry{}, fmt.Errorf("read failed: %w", err)
 	}
 	cert, err := x509.ParseCertificate(data)
 	if err != nil {
 		return discoveredCertEntry{}, fmt.Errorf("parse failed: %w", err)
 	}
 	// Convert to PEM for storage
 	pemStr := string(pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: data}))
 	return certToEntry(cert, path, "DER", pemStr), nil
 }
 // certToEntry converts a parsed x509.Certificate into a discoveredCertEntry.
 func certToEntry(cert *x509.Certificate, path, format, pemData string) discoveredCertEntry {
 	// Compute SHA-256 fingerprint
 	fingerprint := fmt.Sprintf("%x", sha256Sum(cert.Raw))
 	// Determine key algorithm and size
 	keyAlg, keySize := certKeyInfo(cert)
 	return discoveredCertEntry{
 		FingerprintSHA256: fingerprint,
 		CommonName:        cert.Subject.CommonName,
 		SANs:              cert.DNSNames,
 		SerialNumber:      cert.SerialNumber.Text(16),
 		IssuerDN:          cert.Issuer.String(),
 		SubjectDN:         cert.Subject.String(),
 		NotBefore:         cert.NotBefore.UTC().Format(time.RFC3339),
 		NotAfter:          cert.NotAfter.UTC().Format(time.RFC3339),
 		KeyAlgorithm:      keyAlg,
 		KeySize:           keySize,
 		IsCA:              cert.IsCA,
 		PEMData:           pemData,
 		SourcePath:        path,
 		SourceFormat:      format,
 	}
 }
 // sha256Sum returns the SHA-256 hash of data.
 func sha256Sum(data []byte) [32]byte {
 	return sha256.Sum256(data)
 }
 // certKeyInfo extracts key algorithm name and size from a certificate.
 func certKeyInfo(cert *x509.Certificate) (string, int) {
 	switch pub := cert.PublicKey.(type) {
 	case *ecdsa.PublicKey:
 		return "ECDSA", pub.Curve.Params().BitSize
 	case *rsa.PublicKey:
 		return "RSA", pub.N.BitLen()
 	default:
 		switch cert.PublicKeyAlgorithm {
 		case x509.Ed25519:
 			return "Ed25519", 256
 		default:
 			return cert.PublicKeyAlgorithm.String(), 0
 		}
 	}
 }
 func main() {
 	// Parse command-line flags (with env var fallbacks for Docker deployment)
 	serverURL := flag.String("server", getEnvDefault("CERTCTL_SERVER_URL", "https://localhost:8443"), "Control plane server URL (must be https://)")
@@ -0,0 +1,291 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package main
 import (
 	"context"
 	"crypto/ecdsa"
 	"crypto/elliptic"
 	"crypto/rand"
 	"crypto/x509"
 	"crypto/x509/pkix"
 	"encoding/json"
 	"encoding/pem"
 	"fmt"
 	"io"
 	"net/http"
 	"os"
 	"path/filepath"
 	"strings"
 )
 // Phase 9 ARCH-M2 closure Sprint 12 (2026-05-14): extracted from
 // cmd/agent/main.go via the Option B sibling-file pattern (mirrors
 // the Sprint 8 cmd/server cut). Package stays `main`; all methods
 // are still defined on *Agent so every call site continues to
 // resolve through Go's same-package method-set without any
 // import-path change.
 //
 // This file holds the WORK-POLLING entry point + CSR-job execution
 // — the inbound side of the agent's pull-only deployment model
 // (per CLAUDE.md "Pull-only deployment model" architecture
 // decision):
 //
 //   - pollForWork: queries GET /api/v1/agents/{id}/work each tick;
 //     dispatches each returned JobItem to the appropriate
 //     executor (CSR vs deployment).
 //   - executeCSRJob: handles AwaitingCSR jobs by generating an
 //     ECDSA P-256 key locally, persisting it to keyDir/<certID>.key
 //     with 0600 permissions (key NEVER leaves the agent — see
 //     CLAUDE.md "Agent-based key management"), creating the CSR,
 //     and POSTing it to the control plane for signing.
 //
 // The deployment-job executor lives in deploy.go alongside the
 // target connector factory + deploy-only helpers (splitPEMChain,
 // fetchCertificate). The discovery scan lives in discovery.go.
 // pollForWork queries the control plane for actionable jobs and processes them.
 // Jobs may be deployment jobs (Pending) or CSR jobs (AwaitingCSR).
 // GET /api/v1/agents/{agentID}/work
 func (a *Agent) pollForWork(ctx context.Context) {
 	a.logger.Debug("polling for work", "agent_id", a.config.AgentID)
 	path := fmt.Sprintf("/api/v1/agents/%s/work", a.config.AgentID)
 	resp, err := a.makeRequest(ctx, http.MethodGet, path, nil)
 	if err != nil {
 		a.logger.Error("work poll failed", "error", err)
 		a.consecutiveFailures++
 		return
 	}
 	defer resp.Body.Close()
 	// I-004: same terminal-retirement handling as sendHeartbeat. Work-poll is the
 	// other hot path that can observe an agent's soft-retirement; if the
 	// heartbeat tick happens to fire after a work-poll tick within the same
 	// retirement window, this branch catches it first. markRetired's sync.Once
 	// guards idempotency so racing both paths in the same tick only closes the
 	// signal channel once. No consecutiveFailures increment — retirement is
 	// not a transient failure.
 	if resp.StatusCode == http.StatusGone {
 		body, _ := io.ReadAll(resp.Body)
 		a.markRetired("work_poll", resp.StatusCode, string(body))
 		return
 	}
 	if resp.StatusCode != http.StatusOK {
 		body, _ := io.ReadAll(resp.Body)
 		a.logger.Error("work poll rejected",
 			"status", resp.StatusCode,
 			"body", string(body))
 		a.consecutiveFailures++
 		return
 	}
 	var workResp WorkResponse
 	if err := json.NewDecoder(resp.Body).Decode(&workResp); err != nil {
 		a.logger.Error("failed to decode work response", "error", err)
 		a.consecutiveFailures++
 		return
 	}
 	a.consecutiveFailures = 0
 	if workResp.Count == 0 {
 		a.logger.Debug("no pending work")
 		return
 	}
 	a.logger.Info("received work", "job_count", workResp.Count)
 	// Process each job based on type and status
 	for _, job := range workResp.Jobs {
 		switch {
 		case job.Status == "AwaitingCSR":
 			// Agent keygen mode: generate key locally, create CSR, submit to server
 			a.executeCSRJob(ctx, job)
 		case job.Type == "Deployment":
 			a.executeDeploymentJob(ctx, job)
 		}
 	}
 }
 // executeCSRJob handles an AwaitingCSR job: generates a private key locally, creates a CSR,
 // and submits it to the control plane for signing. The private key is stored on the local
 // filesystem with 0600 permissions and NEVER sent to the server.
 //
 // Flow:
 // 1. Generate ECDSA P-256 key pair
 // 2. Store private key to disk (keyDir/certID.key) with 0600 permissions
 // 3. Create CSR with common name and SANs from work response
 // 4. Submit CSR to control plane via POST /agents/{id}/csr
 // 5. Server signs the CSR and creates a cert version + deployment jobs
 func (a *Agent) executeCSRJob(ctx context.Context, job JobItem) {
 	a.logger.Info("executing CSR job (agent-side key generation)",
 		"job_id", job.ID,
 		"certificate_id", job.CertificateID,
 		"common_name", job.CommonName)
 	// Step 1: Generate ECDSA P-256 key pair
 	privKey, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
 	if err != nil {
 		a.logger.Error("failed to generate private key",
 			"job_id", job.ID,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key generation failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	a.logger.Info("generated ECDSA P-256 key pair locally",
 		"job_id", job.ID,
 		"certificate_id", job.CertificateID)
 	// Step 2: Store private key to disk with secure permissions.
 	//
 	// Bundle-9 / Audit L-002 + L-003: marshal+write through helpers that
 	// (a) zeroize the in-heap DER buffer immediately after the PEM block is
 	// constructed so the private scalar's exposure window is bounded by
 	// this function call, and (b) assert the key directory is mode 0700
 	// before any write touches disk. Also defer-clear the PEM buffer for
 	// the same reason — the encoded key isn't sensitive in transit (it's
 	// going to disk) but lingers on the heap if we don't.
 	//
 	// SEC-002 closure (Sprint 1, 2026-05-16): safeAgentKeyPath validates
 	// the certificate_id shape AND asserts the joined path is contained
 	// within a.config.KeyDir. A crafted certificate_id like
 	// "../../etc/passwd" or "/abs/path" now fails closed before any
 	// disk I/O. See cmd/agent/keymem.go for the helper.
 	keyPath, kerr := safeAgentKeyPath(a.config.KeyDir, job.CertificateID)
 	if kerr != nil {
 		a.logger.Error("agent key path validation failed", "job_id", job.ID, "certificate_id", job.CertificateID, "error", kerr)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key path validation failed: %v", kerr)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	if err := ensureAgentKeyDirSecure(filepath.Dir(keyPath)); err != nil {
 		a.logger.Error("agent key dir hardening failed", "job_id", job.ID, "error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key dir hardening failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	var privKeyPEM []byte
 	if marshalErr := marshalAgentKeyAndZeroize(privKey, func(der []byte) error {
 		privKeyPEM = pem.EncodeToMemory(&pem.Block{
 			Type:  "EC PRIVATE KEY",
 			Bytes: der,
 		})
 		return nil
 	}); marshalErr != nil {
 		a.logger.Error("failed to marshal private key",
 			"job_id", job.ID,
 			"error", marshalErr)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key marshal failed: %v", marshalErr)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	defer clear(privKeyPEM)
 	if err := os.WriteFile(keyPath, privKeyPEM, 0600); err != nil {
 		a.logger.Error("failed to write private key to disk",
 			"job_id", job.ID,
 			"key_path", keyPath,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key storage failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	a.logger.Info("private key stored securely",
 		"job_id", job.ID,
 		"key_path", keyPath,
 		"permissions", "0600")
 	// Validate common name is present
 	if job.CommonName == "" {
 		a.logger.Error("empty common name in CSR job", "job_id", job.ID)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", "empty common name"); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "error", reportErr)
 		}
 		return
 	}
 	// Step 3: Create CSR with common name and SANs
 	// Split SANs into DNS names and email addresses for proper CSR encoding
 	var dnsNames []string
 	var emailAddresses []string
 	for _, san := range job.SANs {
 		if strings.Contains(san, "@") {
 			emailAddresses = append(emailAddresses, san)
 		} else {
 			dnsNames = append(dnsNames, san)
 		}
 	}
 	csrTemplate := &x509.CertificateRequest{
 		Subject: pkix.Name{
 			CommonName: job.CommonName,
 		},
 		DNSNames:       dnsNames,
 		EmailAddresses: emailAddresses,
 	}
 	csrDER, err := x509.CreateCertificateRequest(rand.Reader, csrTemplate, privKey)
 	if err != nil {
 		a.logger.Error("failed to create CSR",
 			"job_id", job.ID,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("CSR creation failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	csrPEM := string(pem.EncodeToMemory(&pem.Block{
 		Type:  "CERTIFICATE REQUEST",
 		Bytes: csrDER,
 	}))
 	// Step 4: Submit CSR to the control plane (only the public key leaves the agent)
 	a.logger.Info("submitting CSR to control plane",
 		"job_id", job.ID,
 		"certificate_id", job.CertificateID)
 	submitPath := fmt.Sprintf("/api/v1/agents/%s/csr", a.config.AgentID)
 	resp, err := a.makeRequest(ctx, http.MethodPost, submitPath, map[string]string{
 		"csr_pem":        csrPEM,
 		"certificate_id": job.CertificateID,
 	})
 	if err != nil {
 		a.logger.Error("failed to submit CSR",
 			"job_id", job.ID,
 			"error", err)
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("CSR submission failed: %v", err)); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	defer resp.Body.Close()
 	if resp.StatusCode != http.StatusAccepted {
 		body, _ := io.ReadAll(resp.Body)
 		a.logger.Error("CSR submission rejected",
 			"job_id", job.ID,
 			"status", resp.StatusCode,
 			"body", string(body))
 		if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("CSR rejected: %s", string(body))); reportErr != nil {
 			a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
 		}
 		return
 	}
 	a.logger.Info("CSR submitted and signed successfully",
 		"job_id", job.ID,
 		"certificate_id", job.CertificateID,
 		"key_path", keyPath)
 }
@@ -256,6 +256,18 @@ func TestMain_ServerConfigFromEnvironment(t *testing.T) {
 	os.Setenv("CERTCTL_SERVER_PORT", "8080")
 	os.Setenv("CERTCTL_SERVER_TLS_CERT_PATH", certPath)
 	os.Setenv("CERTCTL_SERVER_TLS_KEY_PATH", keyPath)
 	// Acquisition-audit RED-003 closure (Sprint 5 ACQ, 2026-05-16):
 	// deny-empty default flipped to true; supply a placeholder token
 	// so Load() succeeds. The defer below restores prior env.
 	oldBootstrap := os.Getenv("CERTCTL_AGENT_BOOTSTRAP_TOKEN")
 	os.Setenv("CERTCTL_AGENT_BOOTSTRAP_TOKEN", "test-bootstrap-token-placeholder")
 	defer func() {
 		if oldBootstrap != "" {
 			os.Setenv("CERTCTL_AGENT_BOOTSTRAP_TOKEN", oldBootstrap)
 		} else {
 			os.Unsetenv("CERTCTL_AGENT_BOOTSTRAP_TOKEN")
 		}
 	}()
 	cfg, err := config.Load()
 	if err != nil {
@@ -317,6 +329,18 @@ func TestMain_AuthTypeConfiguration(t *testing.T) {
 	// Set auth secret for api-key mode
 	os.Setenv("CERTCTL_AUTH_SECRET", "test-secret")
 	// Acquisition-audit RED-003 closure (Sprint 5 ACQ, 2026-05-16):
 	// deny-empty default flipped to true; supply a placeholder token
 	// so Load() succeeds.
 	oldBootstrap := os.Getenv("CERTCTL_AGENT_BOOTSTRAP_TOKEN")
 	os.Setenv("CERTCTL_AGENT_BOOTSTRAP_TOKEN", "test-bootstrap-token-placeholder")
 	defer func() {
 		if oldBootstrap != "" {
 			os.Setenv("CERTCTL_AGENT_BOOTSTRAP_TOKEN", oldBootstrap)
 		} else {
 			os.Unsetenv("CERTCTL_AGENT_BOOTSTRAP_TOKEN")
 		}
 	}()
 	testCases := []string{"api-key", "none"}
@@ -645,3 +669,64 @@ func TestPreflightSCEPChallengePassword(t *testing.T) {
 		})
 	}
 }
 // =============================================================================
 // SEC-003 closure (Sprint 1, 2026-05-16). Pin that the rate-limit-enabled
 // middleware stack still emits the five security headers (HSTS, XFO,
 // nosniff, Referrer-Policy, CSP) that the default stack carries.
 //
 // Pre-fix the stack rebuild at main.go ~L2079 dropped
 // securityHeadersMiddleware so flipping CERTCTL_RATE_LIMIT_ENABLED=true
 // silently turned off five browser-side defenses. This test exercises
 // the same middleware composition main.go now builds when the flag is
 // on, and asserts each header lands on the wire. A future regression
 // that removes securityHeadersMiddleware (or reorders it after the
 // rate limiter such that a 429 response misses the headers) would
 // surface here.
 // =============================================================================
 func TestMain_RateLimitedStack_EmitsSecurityHeaders(t *testing.T) {
 	baseHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		w.WriteHeader(http.StatusOK)
 	})
 	// Mirror the rate-limit-enabled middlewareStack from main.go.
 	rateLimiter := middleware.NewRateLimiter(middleware.RateLimitConfig{
 		RPS:       1000, // high enough that the single test request isn't dropped
 		BurstSize: 1000,
 	})
 	securityHeaders := middleware.SecurityHeaders(middleware.SecurityHeadersDefaults())
 	bodyLimit := middleware.NewBodyLimit(middleware.BodyLimitConfig{MaxBytes: 1 << 20})
 	stack := []func(http.Handler) http.Handler{
 		middleware.RequestID,
 		middleware.Recovery,
 		bodyLimit,
 		securityHeaders,
 		rateLimiter,
 		// Skip the CORS/auth/csrf/audit layers — they aren't relevant
 		// to the headers-on-response invariant we're pinning.
 	}
 	chained := middleware.Chain(baseHandler, stack...)
 	req := httptest.NewRequest(http.MethodGet, "/api/v1/test", nil)
 	w := httptest.NewRecorder()
 	chained.ServeHTTP(w, req)
 	if w.Code != http.StatusOK {
 		t.Fatalf("status = %d; want 200 (rate limit should not trip on a single request)", w.Code)
 	}
 	wantHeaders := map[string]string{
 		"Strict-Transport-Security": "max-age=31536000; includeSubDomains",
 		"X-Frame-Options":           "DENY",
 		"X-Content-Type-Options":    "nosniff",
 		"Referrer-Policy":           "no-referrer-when-downgrade",
 		"Content-Security-Policy":   "default-src 'self'; img-src 'self' data:; style-src 'self' 'unsafe-inline'; script-src 'self'; connect-src 'self'; frame-ancestors 'none'",
 	}
 	for name, want := range wantHeaders {
 		got := w.Header().Get(name)
 		if got != want {
 			t.Errorf("rate-limited stack: %s = %q; want %q", name, got, want)
 		}
 	}
 }
@@ -0,0 +1,209 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package main
 import (
 	"database/sql"
 	"log/slog"
 	"os"
 	"strings"
 	"github.com/certctl-io/certctl/internal/config"
 	"github.com/certctl-io/certctl/internal/repository/postgres"
 )
 // Phase 9 ARCH-M2 closure Sprint 8b (2026-05-14): the deferred half of
 // Sprint 8. Extracts the boot-time migration handling from main()'s
 // inline body into two unexported helpers. Different shape from
 // Sprints 1-7 (data-type relocation) and from Sprint 8a (existing
 // helper-function relocation) — this sprint crosses the
 // behavior-change boundary Sprint 8 first identified.
 //
 // What lives here
 // ===============
 //   parseMigrateOnlyFlag() bool
 //     Hand-parses os.Args for `--migrate-only` (NOT flag.Parse — the
 //     server's config surface is otherwise env-var driven via
 //     config.Load; introducing flag.Parse's global state risks
 //     conflicting with other binaries that may import cmd/server later).
 //
 //   runBootMigrations(cfg, db, logger, migrateOnly) (exitNow bool)
 //     Owns the Phase 4 DEPL-M1 migration-via-hook posture: the
 //     migrationsViaHook env-var read, the RunMigrations + RunSeed
 //     gate, the --migrate-only early-exit signal, and the
 //     CERTCTL_DEMO_SEED demo-overlay branch.
 //
 //     Returns true ONLY when --migrate-only was set and migrations +
 //     seed completed cleanly. The caller (main) translates that to
 //     `return` rather than os.Exit(0) — which is the SOLE intentional
 //     behavior change in this sprint (see below).
 //
 // Behavior preservation contract
 // ==============================
 // Every error path inside runBootMigrations calls os.Exit(1)
 // directly, matching the original inline behavior byte-for-byte
 // (same log message, same exit code, same no-defer-run-on-fatal
 // semantics). The error-path os.Exit(1) is intentional: when
 // migration fails at boot, the server cannot recover, and bailing
 // out without running defers is the original Go-idiomatic shape.
 //
 // The ONE behavior change: the --migrate-only SUCCESS path now
 // returns to main() rather than calling os.Exit(0) inline. This
 // has one observable effect: the `defer db.Close()` registered in
 // main() now runs at clean exit instead of being skipped. That's
 // strictly better hygiene (clean DB connection shutdown vs OS
 // reclaim). The migration work is synchronous + complete before
 // the return; nothing async is left running that db.Close() could
 // truncate.
 //
 // All other paths — the migration log messages, the seed log
 // messages, the migrationsViaHook env-var read order, the
 // RunDemoSeed gating, the per-step success/skip log lines — are
 // byte-identical to the pre-Sprint-8b inline form. Verified via
 // `go test ./cmd/server/... -count=1 -short` (which runs the
 // existing main_test.go assertions through the new call site).
 //
 // Why this is a separate commit
 // =============================
 // Sprint 8a (commit see git log) extracted the bottom-of-file
 // helpers + adapter types — pure mechanical relocation that
 // couldn't change runtime semantics. Sprint 8b crosses the boundary
 // where mechanical relocation ends: introducing a new function
 // call frame changes defer scope, panic recovery, and (in this
 // case) the exit semantics for the --migrate-only path. The
 // Phase 9 prompt's "refactor is mechanical relocation; behavior
 // change is a separate concern" rule guards against exactly this
 // shape of risk being landed without a focused review.
 //
 // Splitting Sprint 8a (mechanical) from Sprint 8b (behavior-aware)
 // means the operator's git log shows:
 //   3f1344e8 ... wire.go         — no behavior change possible
 //   <this>   ... migrations.go    — one specific behavior shift,
 //                                   documented + intentional
 //
 // Anyone bisecting a future bug to one of these two commits gets a
 // clean "is it mechanical or did the behavior change" signal.
 // parseMigrateOnlyFlag scans os.Args for the `--migrate-only` token
 // and returns true if found. Hand-parsed instead of using flag.Parse
 // because:
 //
 //  1. The server's entire config surface is env-var driven via
 //     config.Load(). flag.Parse() introduces a global package-state
 //     dependency that future binaries importing cmd/server (test
 //     harnesses, CLI tools, embedded variants) would have to
 //     coordinate around.
 //  2. The only flag we care about is the migration-vs-server-lifecycle
 //     toggle; a hand-parser is 6 lines and has no transitive cost.
 //  3. The flag is Helm-pre-install-hook-facing (see
 //     deploy/helm/certctl/templates/migration-job.yaml). Its shape is
 //     pinned by that template, not by anything else; we don't need
 //     flag.Parse's auto-help generation or type coercion.
 //
 // Bare arg match — no `=` value form, no short alias, no override
 // from env. Anyone passing `--migrate-only` ANYWHERE in os.Args[1:]
 // flips the flag on. Matches the original inline behavior exactly.
 func parseMigrateOnlyFlag() bool {
 	for _, arg := range os.Args[1:] {
 		if arg == "--migrate-only" {
 			return true
 		}
 	}
 	return false
 }
 // runBootMigrations owns the Phase 4 DEPL-M1 boot-time migration
 // posture. Three lifecycles to support:
 //
 //	(a) Compose / VM / bare-metal: server runs migrations at boot.
 //	    Default behavior — preserved unchanged.
 //	(b) Helm with pre-install/pre-upgrade hook: the migration Job
 //	    runs `certctl-server --migrate-only`, does its work, and
 //	    exits. The server Deployment's pods then start with
 //	    CERTCTL_MIGRATIONS_VIA_HOOK=true set; they see the env
 //	    var and skip their boot-time RunMigrations call so the
 //	    Job's work isn't duplicated.
 //	(c) Bare `certctl-server --migrate-only` invocation (e.g.
 //	    operator running a one-shot migration from the CLI):
 //	    runs migrations + seed and returns true so main returns
 //	    cleanly without starting the HTTP listener / scheduler /
 //	    signing setup.
 //
 // migrateOnly captures case (c); CERTCTL_MIGRATIONS_VIA_HOOK
 // captures case (b). Both paths converge on the same RunMigrations
 // + RunSeed code below.
 //
 // Returns true ONLY when migrateOnly is set; caller (main) handles
 // the clean exit via `return` so deferred cleanup (db.Close) runs.
 // Returns false in every other case — caller continues normal boot.
 // On any migration / seed error: os.Exit(1) inline (matches the
 // pre-extraction shape; recovery is not possible at this boot
 // stage).
 func runBootMigrations(cfg *config.Config, db *sql.DB, logger *slog.Logger, migrateOnly bool) bool {
 	migrationsViaHook := strings.EqualFold(os.Getenv("CERTCTL_MIGRATIONS_VIA_HOOK"), "true")
 	if migrateOnly || !migrationsViaHook {
 		logger.Info("running migrations", "path", cfg.Database.MigrationsPath)
 		if err := postgres.RunMigrations(db, cfg.Database.MigrationsPath); err != nil {
 			logger.Error("failed to run migrations", "error", err)
 			os.Exit(1)
 		}
 		logger.Info("migrations completed")
 	} else {
 		logger.Info("skipping migrations at boot (CERTCTL_MIGRATIONS_VIA_HOOK=true — Helm pre-install/pre-upgrade hook owns this work)")
 	}
 	// Apply baseline seed data.
 	//
 	// U-3 (P1, cat-u-seed_initdb_schema_drift): pre-U-3 seed.sql was mounted
 	// into postgres `/docker-entrypoint-initdb.d/` alongside a hand-curated
 	// subset of migrations. Adding a migration that introduced a new column
 	// referenced by seed.sql (cat-o-retry_interval_unit_mismatch /
 	// policy_rules.severity / etc.) without also updating the compose volume
 	// mounts caused initdb to crash on first up. Post-U-3 the compose stack
 	// drops all initdb mounts; postgres comes up with empty schema, the
 	// server runs RunMigrations above, then this RunSeed call lands the
 	// baseline data — all from a single source of truth (this binary).
 	// See internal/repository/postgres/db.go::RunSeed for the contract.
 	//
 	// Phase 4 DEPL-M1: same migration-via-hook gating as RunMigrations.
 	// When the hook owns migrations it also owns the seed pass.
 	if migrateOnly || !migrationsViaHook {
 		logger.Info("applying baseline seed", "path", cfg.Database.MigrationsPath)
 		if err := postgres.RunSeed(db, cfg.Database.MigrationsPath); err != nil {
 			logger.Error("failed to apply seed data", "error", err)
 			os.Exit(1)
 		}
 		logger.Info("seed completed")
 	} else {
 		logger.Info("skipping baseline seed at boot (CERTCTL_MIGRATIONS_VIA_HOOK=true — hook applies seed alongside migrations)")
 	}
 	// Phase 4 DEPL-M1: --migrate-only early-exit. Migrations + seed are
 	// done; the operator only asked for the migration pass. Signal main
 	// to return cleanly so deferred db.Close runs (Sprint 8b improvement
 	// over the pre-extraction os.Exit(0) which skipped defers).
 	if migrateOnly {
 		logger.Info("--migrate-only: migrations + seed complete; exiting without starting server lifecycle")
 		return true
 	}
 	// Apply demo overlay seed when CERTCTL_DEMO_SEED=true. Pre-U-3 the demo
 	// overlay (deploy/docker-compose.demo.yml) mounted seed_demo.sql into
 	// postgres `/docker-entrypoint-initdb.d/`; that broke once U-3 dropped
 	// the initdb migration mounts (the demo seed references tables that
 	// wouldn't exist at initdb time). The runtime path here is the
 	// post-U-3 replacement. Default-off so a vanilla deploy never lands
 	// fake-history rows. See postgres.RunDemoSeed for the contract.
 	if cfg.Database.DemoSeed {
 		logger.Info("applying demo seed (CERTCTL_DEMO_SEED=true)", "path", cfg.Database.MigrationsPath)
 		if err := postgres.RunDemoSeed(db, cfg.Database.MigrationsPath); err != nil {
 			logger.Error("failed to apply demo seed data", "error", err)
 			os.Exit(1)
 		}
 		logger.Info("demo seed completed")
 	}
 	return false
 }
@@ -0,0 +1,758 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package main
 import (
 	"context"
 	"crypto"
 	"crypto/tls"
 	"crypto/x509"
 	"encoding/pem"
 	"fmt"
 	"log/slog"
 	"net/http"
 	"os"
 	"strings"
 	"time"
 	"github.com/certctl-io/certctl/internal/api/handler"
 	oidcdomain "github.com/certctl-io/certctl/internal/auth/oidc/domain"
 	"github.com/certctl-io/certctl/internal/auth/session"
 	userdomain "github.com/certctl-io/certctl/internal/auth/user/domain"
 	"github.com/certctl-io/certctl/internal/domain"
 	authdomainAlias "github.com/certctl-io/certctl/internal/domain/auth"
 	"github.com/certctl-io/certctl/internal/repository"
 	"github.com/certctl-io/certctl/internal/repository/postgres"
 	"github.com/certctl-io/certctl/internal/scep/intune"
 	"github.com/certctl-io/certctl/internal/service"
 	authsvc "github.com/certctl-io/certctl/internal/service/auth"
 	"github.com/certctl-io/certctl/internal/trustanchor"
 )
 // Phase 9 ARCH-M2 closure Sprint 8 (2026-05-14): extracted from
 // cmd/server/main.go. Different shape from the config.go cuts —
 // the move is by FUNCTIONAL CONCERN (boot-time preflight + DI
 // adapter wiring), not by TYPE FAMILY.
 //
 // Sprint 8 ships TWO of the three files the Phase 9 prompt names:
 //   - main.go      — entrypoint (unchanged; what's left after the cut)
 //   - wire.go      — this file (DI assembly: preflight helpers +
 //                    adapter types that bridge package boundaries)
 //
 // The third file the prompt names — migrations.go — is NOT in this
 // commit. See "What's NOT in this sprint" below for the deferral
 // rationale.
 //
 // What lives here
 // ===============
 // Seven preflight + DI helper functions:
 //   - preflightSCEPChallengePassword   (H-2 fix: SCEP needs non-empty
 //                                       shared secret if enabled)
 //   - preflightSCEPMTLSTrustBundle     (SCEP Phase 6.5: per-profile
 //                                       mTLS CA bundle validation)
 //   - preflightESTMTLSClientCATrustBundle (EST Phase 2.5: same shape,
 //                                       returns SIGHUP-reloadable
 //                                       *trustanchor.Holder)
 //   - preflightSCEPIntuneTrustAnchor   (SCEP Phase 8.2: Intune
 //                                       Connector signing-cert bundle)
 //   - loadSCEPRAPair                   (post-preflight cert+key load)
 //   - preflightSCEPRACertKey           (RA cert/key validation: file
 //                                       mode 0600, cert+key match,
 //                                       NotAfter, RSA-or-ECDSA alg)
 //   - preflightEnrollmentIssuer        (L-005: EST/SCEP issuer can
 //                                       serve GetCACertPEM)
 //   - buildFinalHandler                (M-001 option D: HTTP dispatch
 //                                       wrapper routing
 //                                       authenticated vs no-auth
 //                                       chains by URL prefix)
 //
 // Five adapter types that bridge package boundaries (avoid import
 // cycles between internal/auth, internal/service/auth,
 // internal/api/handler, internal/auth/oidc, internal/auth/session,
 // internal/auth/breakglass):
 //   - authPermissionCheckerAdapter      (typed-string → plain-string
 //                                        auth.PermissionChecker
 //                                        interface)
 //   - authCheckResolverAdapter          (postgres ActorRoleRepository
 //                                        → handler.AuthCheckResolver)
 //   - sessionMinterAdapter              (session.Service → OIDC
 //                                        SessionMinter port)
 //   - breakglassSessionMinterAdapter    (session.Service → breakglass
 //                                        SessionMinter port + audit
 //                                        2026-05-10 HIGH-1 revoke-all)
 //   - oidcProvidersListAdapter          (postgres OIDCProviderRepository
 //                                        → handler.OIDCProvidersListResolver
 //                                        with MED-9 enabled-filter)
 //
 // Plus the silenceUnusedImports var-block that pins
 // oidcdomain.OIDCProvider as a load-bearing reference (the adapter
 // types use *userdomain.User and repository.OIDCProviderRepository
 // indirectly; oidcdomain.OIDCProvider isn't named in any function
 // signature here but is part of the Phase 3 SessionMinter contract).
 //
 // What's NOT in this sprint (and why)
 // ===================================
 // migrations.go is deferred. The Phase 9 prompt asks for three files:
 // main.go (entrypoint) + wire.go (this file) + migrations.go (boot-
 // time migration handling). The migration code (Phase 4 DEPL-M1
 // --migrate-only flag handling + RunMigrations + RunSeed call +
 // CERTCTL_MIGRATIONS_VIA_HOOK gating) lives INLINE inside the 2300-
 // line main() function — lines ~59-264 in the original — not as a
 // standalone helper.
 //
 // Extracting it into a migrations.go would require:
 //   1. Creating a new unexported function (e.g.,
 //      runMigrations(ctx, cfg, db, logger) error) that consolidates
 //      lines ~71-77 (--migrate-only parse) + ~199-248 (the migration
 //      branch + --migrate-only early-exit) + ~250-264 (the demo
 //      overlay seed branch).
 //   2. Replacing the inline block in main() with a single call.
 //   3. Threading the early-exit semantics out (os.Exit(0) vs return
 //      "migration done" sentinel error vs a third option) so main's
 //      defer ordering doesn't change.
 //
 // That's behavior-change territory — a new function call frame, a
 // new defer scope, error-handling pattern shift. Different risk
 // shape from the pure-data type relocations Sprints 1-7 did. The
 // Phase 9 prompt says "Do NOT change exported type signatures; the
 // refactor is mechanical relocation; behavior change is a separate
 // concern." Extracting an inline block from main() into a new
 // function is the same shape of risk that rule was guarding against.
 //
 // Recommended path for the migrations.go cut:
 //   - Land it as a separate, smaller PR with its own review focus
 //     (the runMigrations function shape, the early-exit semantics,
 //     unit tests for the new function via the existing main_test.go
 //     fixture). The infrastructure for the PR exists today; only
 //     the operator's go-ahead on the behavior-change risk is needed.
 //   - Estimated impact: another ~80-120 LOC out of main.go (the
 //     migration + seed + early-exit block) into a new migrations.go.
 //   - Phase 4's --migrate-only code path already runs through this
 //     code section, so the extracted function should reproduce that
 //     exact flow without behavior change beyond the call-frame
 //     introduction.
 //
 // Public-surface invariant
 // ========================
 // The moved helpers + adapter types are all in package `main`
 // (which Go cannot expose to external importers). No exported
 // surface changes. The reorganization is invisible outside
 // cmd/server/. Same-package callers in main.go (preflight*
 // invocations, adapter instantiation) resolve via the package
 // symbol table without modification.
 // preflightSCEPChallengePassword enforces the H-2 fix: if SCEP is enabled, a
 // non-empty challenge password MUST be configured. Returns a non-nil error
 // otherwise so the caller can refuse to start the control plane (CWE-306,
 // missing authentication for a critical function).
 //
 // This helper is extracted so the check can be unit tested without booting
 // the full server. The caller (main) is responsible for translating the
 // returned error into a structured log line and os.Exit(1).
 func preflightSCEPChallengePassword(enabled bool, challengePassword string) error {
 	if !enabled {
 		return nil
 	}
 	if challengePassword == "" {
 		return fmt.Errorf("SCEP enabled but CERTCTL_SCEP_CHALLENGE_PASSWORD is empty: " +
 			"SCEP enrollment would accept any client (CWE-306); " +
 			"configure a non-empty shared secret or set CERTCTL_SCEP_ENABLED=false")
 	}
 	return nil
 }
 // preflightSCEPMTLSTrustBundle validates a per-profile mTLS client-CA
 // trust bundle. SCEP RFC 8894 + Intune master bundle Phase 6.5.
 //
 // Mirrors preflightSCEPRACertKey's no-op-when-disabled pattern; otherwise
 // the checks are:
 //
 //  1. Path is non-empty (the Validate() refuse covers this too, but
 //     preflight reports the specific failure with an actionable error
 //     string + os.Exit(1) at the call site).
 //  2. File exists + readable.
 //  3. PEM-decodes to ≥1 CERTIFICATE block.
 //  4. None of the bundled certs is past NotAfter — an expired trust
 //     anchor would silently reject every client cert at runtime.
 //
 // On success, returns the parsed *x509.CertPool ready to inject into the
 // per-profile SCEPHandler via SetMTLSTrustPool. Each bundled cert also
 // contributes to the union pool that backs the TLS-layer
 // VerifyClientCertIfGiven.
 func preflightSCEPMTLSTrustBundle(enabled bool, bundlePath string) (*x509.CertPool, error) {
 	if !enabled {
 		return nil, nil
 	}
 	if bundlePath == "" {
 		return nil, fmt.Errorf("MTLS enabled but trust bundle path empty: " +
 			"set CERTCTL_SCEP_PROFILE_<NAME>_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH to a PEM file " +
 			"containing the bootstrap-CA certs the operator allows to enroll")
 	}
 	body, err := os.ReadFile(bundlePath)
 	if err != nil {
 		return nil, fmt.Errorf("read MTLS trust bundle: %w (path=%s)", err, bundlePath)
 	}
 	pool := x509.NewCertPool()
 	rest := body
 	count := 0
 	now := time.Now()
 	for {
 		var block *pem.Block
 		block, rest = pem.Decode(rest)
 		if block == nil {
 			break
 		}
 		if block.Type != "CERTIFICATE" {
 			continue
 		}
 		cert, err := x509.ParseCertificate(block.Bytes)
 		if err != nil {
 			return nil, fmt.Errorf("parse MTLS trust bundle cert: %w (path=%s)", err, bundlePath)
 		}
 		if now.After(cert.NotAfter) {
 			return nil, fmt.Errorf("MTLS trust bundle cert expired at %s (subject=%q, path=%s) — replace before restart",
 				cert.NotAfter.Format(time.RFC3339), cert.Subject.CommonName, bundlePath)
 		}
 		pool.AddCert(cert)
 		count++
 	}
 	if count == 0 {
 		return nil, fmt.Errorf("MTLS trust bundle contained no CERTIFICATE PEM blocks (path=%s)", bundlePath)
 	}
 	return pool, nil
 }
 // preflightESTMTLSClientCATrustBundle validates a per-profile EST mTLS
 // client-CA trust bundle and returns a SIGHUP-reloadable holder.
 //
 // EST RFC 7030 hardening master bundle Phase 2.5.
 //
 // Mirrors preflightSCEPMTLSTrustBundle's checks (file exists, parses as
 // PEM, ≥1 cert, none expired) but returns a *trustanchor.Holder rather
 // than a raw *x509.CertPool — the EST handler stores the holder so a
 // SIGHUP rotates the trust bundle live without a server restart, exactly
 // the way the Intune trust anchor rotation works (Phase 8.5 of the SCEP
 // bundle). The handler-side .Pool() accessor on the holder rebuilds an
 // x509.CertPool from the current snapshot for each Verify call.
 //
 // Uses the shared internal/trustanchor.LoadBundle (extracted in EST
 // hardening Phase 2.1 from the original Intune-only path) so the EST
 // + Intune callers exercise the same loader semantics — empty bundle
 // rejected, expired cert rejected with subject in error message,
 // non-CERTIFICATE PEM blocks tolerated.
 func preflightESTMTLSClientCATrustBundle(enabled bool, pathID, bundlePath string, logger *slog.Logger) (*trustanchor.Holder, error) {
 	if !enabled {
 		return nil, nil
 	}
 	if bundlePath == "" {
 		return nil, fmt.Errorf("EST profile (PathID=%q) MTLS enabled but trust bundle path empty: "+
 			"set CERTCTL_EST_PROFILE_<NAME>_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH to a PEM file "+
 			"containing the bootstrap-CA certs the operator allows to enroll", pathID)
 	}
 	holder, err := trustanchor.New(bundlePath, logger)
 	if err != nil {
 		return nil, fmt.Errorf("EST profile (PathID=%q) MTLS trust bundle preflight: %w", pathID, err)
 	}
 	holder.SetLabelForLog(fmt.Sprintf("EST mTLS client CA bundle (PathID=%q)", pathID))
 	return holder, nil
 }
 // preflightSCEPIntuneTrustAnchor validates a per-profile Microsoft Intune
 // Certificate Connector signing-cert trust bundle.
 //
 // SCEP RFC 8894 + Intune master bundle Phase 8.2.
 //
 // No-op when this profile has Intune disabled (the common case for
 // non-Intune SCEP deploys). When enabled:
 //
 //  1. Path is non-empty (Validate() refuse covers this too; we re-check
 //     here so the caller can os.Exit(1) with the specific PathID in the
 //     log line).
 //  2. File exists + readable.
 //  3. PEM-decodes to ≥1 CERTIFICATE block (intune.LoadTrustAnchor enforces
 //     this and skips non-CERTIFICATE blocks like accidentally-pasted
 //     priv-key blocks).
 //  4. None of the bundled certs is past NotAfter — an expired Intune
 //     trust anchor would silently reject every Connector challenge at
 //     runtime, which is a much worse failure mode than failing fast at
 //     boot. intune.LoadTrustAnchor enforces this and surfaces the subject
 //     CN in the error message so the operator knows which cert to rotate.
 //
 // On success returns the freshly-built *intune.TrustAnchorHolder ready to
 // inject into the per-profile SCEPService via SetIntuneIntegration. The
 // holder also installs the SIGHUP watcher (started by the caller).
 func preflightSCEPIntuneTrustAnchor(enabled bool, pathID, path string, logger *slog.Logger) (*intune.TrustAnchorHolder, error) {
 	if !enabled {
 		return nil, nil
 	}
 	// pathIDLabel renders the empty-string PathID as "<root>" so the
 	// operator's boot-log error doesn't read like a missing variable.
 	pathIDLabel := pathID
 	if pathIDLabel == "" {
 		pathIDLabel = "<root>"
 	}
 	if path == "" {
 		return nil, fmt.Errorf("SCEP profile (PathID=%q) INTUNE enabled but trust anchor path empty: "+
 			"set CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CONNECTOR_CERT_PATH to a PEM bundle "+
 			"of the Microsoft Intune Certificate Connector's signing certs", pathIDLabel)
 	}
 	holder, err := intune.NewTrustAnchorHolder(path, logger)
 	if err != nil {
 		return nil, fmt.Errorf("SCEP profile (PathID=%q) INTUNE trust anchor load failed: %w (path=%s)", pathIDLabel, err, path)
 	}
 	return holder, nil
 }
 // loadSCEPRAPair reads the RA cert PEM + key PEM and returns the parsed
 // x509.Certificate + crypto.PrivateKey ready for the SCEP handler's RFC
 // 8894 path. Called AFTER preflightSCEPRACertKey passed; failures here
 // indicate a TOCTOU race or a filesystem change between preflight and
 // the load (rare).
 //
 // Cert PEM may carry a chain (CA + RA + intermediate); we use the FIRST
 // CERTIFICATE block, matching the RFC 8894 §3.5.1 single-cert convention
 // for the GetCACert response.
 func loadSCEPRAPair(certPath, keyPath string) (*x509.Certificate, crypto.PrivateKey, error) {
 	certPEM, err := os.ReadFile(certPath)
 	if err != nil {
 		return nil, nil, fmt.Errorf("read RA cert: %w", err)
 	}
 	keyPEM, err := os.ReadFile(keyPath)
 	if err != nil {
 		return nil, nil, fmt.Errorf("read RA key: %w", err)
 	}
 	pair, err := tls.X509KeyPair(certPEM, keyPEM)
 	if err != nil {
 		return nil, nil, fmt.Errorf("parse RA pair: %w", err)
 	}
 	if len(pair.Certificate) == 0 {
 		return nil, nil, fmt.Errorf("RA cert PEM contained no certificate blocks")
 	}
 	leaf, err := x509.ParseCertificate(pair.Certificate[0])
 	if err != nil {
 		return nil, nil, fmt.Errorf("parse RA cert: %w", err)
 	}
 	return leaf, pair.PrivateKey, nil
 }
 // preflightSCEPRACertKey validates the RA cert/key pair the RFC 8894 SCEP
 // path requires. Mirrors preflightSCEPChallengePassword's no-op-when-disabled
 // pattern; otherwise the checks are:
 //
 //  1. Both paths are non-empty (the Validate() refuse covers this too,
 //     but preflight reports the specific failure mode + os.Exit(1) so the
 //     operator sees a clear log line in addition to the config error).
 //  2. The key file mode is 0600 (refuse world-/group-readable RA key —
 //     defense-in-depth against credential leak via a misconfigured
 //     deploy that leaves /etc/certctl/scep/*.key as 0644).
 //  3. Cert PEM parses to exactly one x509.Certificate.
 //  4. Key PEM parses to a Go crypto.Signer (RSA or ECDSA — RFC 8894
 //     §3.5.2 advertises those as the CMS-compatible algorithms).
 //  5. The cert's PublicKey matches the key's Public() — refuses pairs
 //     accidentally swapped between profiles in a multi-profile config.
 //  6. The cert's NotAfter is in the future — an expired RA cert would
 //     fail TLS handshake on EnvelopedData decryption per RFC 5652.
 //
 // Each check returns a wrapped error; the caller (main) is responsible for
 // translating to a structured slog.Error + os.Exit(1) so the helper stays
 // unit-testable without booting the full server.
 func preflightSCEPRACertKey(enabled bool, raCertPath, raKeyPath string) error {
 	if !enabled {
 		return nil
 	}
 	if raCertPath == "" || raKeyPath == "" {
 		return fmt.Errorf("SCEP enabled but RA pair missing: " +
 			"set CERTCTL_SCEP_RA_CERT_PATH + CERTCTL_SCEP_RA_KEY_PATH " +
 			"(RFC 8894 §3.2.2 requires an RA pair so clients can encrypt the " +
 			"CSR to the RA cert and the server can sign the CertRep response)")
 	}
 	// File mode check FIRST so a world-readable key never gets read into the
 	// process address space. Ignored on Windows (Stat().Mode() doesn't carry
 	// POSIX bits there); the production deploy is Linux per the Dockerfile.
 	keyInfo, err := os.Stat(raKeyPath)
 	if err != nil {
 		return fmt.Errorf("CERTCTL_SCEP_RA_KEY_PATH stat failed: %w (path=%s)", err, raKeyPath)
 	}
 	mode := keyInfo.Mode().Perm()
 	if mode&0o077 != 0 {
 		return fmt.Errorf("CERTCTL_SCEP_RA_KEY_PATH has insecure permissions %#o; "+
 			"RA private key must be mode 0600 (owner read/write only) — "+
 			"chmod 0600 %s and restart", mode, raKeyPath)
 	}
 	certPEM, err := os.ReadFile(raCertPath)
 	if err != nil {
 		return fmt.Errorf("CERTCTL_SCEP_RA_CERT_PATH read failed: %w (path=%s)", err, raCertPath)
 	}
 	keyPEM, err := os.ReadFile(raKeyPath)
 	if err != nil {
 		return fmt.Errorf("CERTCTL_SCEP_RA_KEY_PATH read failed: %w (path=%s)", err, raKeyPath)
 	}
 	// tls.X509KeyPair validates that the cert + key parse, share an algorithm,
 	// and the cert's PublicKey matches the key's Public() — three of our six
 	// checks in a single stdlib call, so we use it rather than re-implementing.
 	pair, err := tls.X509KeyPair(certPEM, keyPEM)
 	if err != nil {
 		return fmt.Errorf("RA cert/key pair invalid: %w "+
 			"(cert=%s key=%s) — verify the cert and key are matching halves of "+
 			"the same RA pair, both PEM-encoded, with the cert containing exactly "+
 			"one CERTIFICATE block and the key containing one PRIVATE KEY block",
 			err, raCertPath, raKeyPath)
 	}
 	if len(pair.Certificate) == 0 {
 		// Defensive — tls.X509KeyPair already errors on this, but the contract
 		// for the next x509.ParseCertificate call needs the slice non-empty.
 		return fmt.Errorf("RA cert PEM at %s contains no certificate blocks", raCertPath)
 	}
 	// Re-parse the leaf so we can read NotAfter + the public-key alg.
 	leaf, err := x509.ParseCertificate(pair.Certificate[0])
 	if err != nil {
 		return fmt.Errorf("RA cert at %s does not parse as x509: %w", raCertPath, err)
 	}
 	if time.Now().After(leaf.NotAfter) {
 		return fmt.Errorf("RA cert at %s expired at %s — "+
 			"generate a fresh RA pair (the SCEP CertRep signature would be "+
 			"rejected by every conformant client)", raCertPath, leaf.NotAfter.Format(time.RFC3339))
 	}
 	// CMS-compatible public-key algorithm gate. RFC 8894 §3.5.2 advertises RSA
 	// and AES; the responder cert algorithm pertains to the signature scheme
 	// used on the CertRep, which means the cert's PublicKey must be RSA or
 	// ECDSA. Catches pre-shared Ed25519 dev keys that micromdm/scep clients
 	// reject.
 	switch leaf.PublicKeyAlgorithm {
 	case x509.RSA, x509.ECDSA:
 		// ok — supported by golang.org/x/crypto/ocsp + every SCEP client
 	default:
 		return fmt.Errorf("RA cert at %s uses unsupported public-key algorithm %s — "+
 			"RFC 8894 §3.5.2 CMS signing requires RSA or ECDSA",
 			raCertPath, leaf.PublicKeyAlgorithm)
 	}
 	return nil
 }
 // preflightEnrollmentIssuer validates at startup that an EST/SCEP-bound issuer
 // can actually serve a CA certificate. This closes audit finding L-005:
 // pre-Bundle-4 the EST/SCEP startup path verified the issuer existed in the
 // registry but did not verify the issuer TYPE could emit a CA cert. An
 // operator who bound CERTCTL_EST_ISSUER_ID to an ACME issuer (which does
 // not have a static CA cert — see internal/connector/issuer/acme/acme.go::
 // GetCACertPEM returning an explicit error) would boot successfully and
 // only see failures at the first /est/cacerts request, hiding the misconfig
 // for hours/days behind a degraded enrollment surface.
 //
 // Strategy: call issuerConn.GetCACertPEM(ctx) at startup with a short
 // timeout. If the issuer can serve a CA cert (local, vault, openssl,
 // stepca, awsacmpca, etc.), the call succeeds and we proceed. If not
 // (acme, digicert, sectigo, entrust, googlecas, ejbca, globalsign — most
 // vendor-CA issuers that hand back chains per-issuance), the call fails
 // loudly with the connector's own error string, and the caller os.Exit(1)s.
 //
 // Returns nil on success, non-nil error suitable for structured logging
 // + os.Exit(1) by the caller. Caller is responsible for the timeout context.
 func preflightEnrollmentIssuer(ctx context.Context, protocol, issuerID string, issuerConn service.IssuerConnector) error {
 	if issuerConn == nil {
 		return fmt.Errorf("%s issuer %q: connector is nil", protocol, issuerID)
 	}
 	caCertPEM, err := issuerConn.GetCACertPEM(ctx)
 	if err != nil {
 		return fmt.Errorf("%s issuer %q: cannot serve CA certificate (%w); "+
 			"choose an issuer type that exposes a static CA chain "+
 			"(local / vault / openssl / stepca / awsacmpca) or disable %s",
 			protocol, issuerID, err, protocol)
 	}
 	if caCertPEM == "" {
 		return fmt.Errorf("%s issuer %q: GetCACertPEM returned empty PEM with no error; "+
 			"choose an issuer type that exposes a static CA chain", protocol, issuerID)
 	}
 	return nil
 }
 // buildFinalHandler builds the outer HTTP dispatch handler that routes incoming
 // requests to either the authenticated apiHandler chain or the unauthenticated
 // noAuthHandler chain based on URL path prefix. Extracted from main() so the
 // dispatch logic can be unit tested without booting the full server stack
 // (see cmd/server/finalhandler_test.go).
 //
 // Dispatch rules (M-001, audit 2026-04-19, option D):
 //
 //   - /health, /ready, /api/v1/auth/info           → no-auth (probes + login detection)
 //   - /api/v1/version                              → no-auth (U-3 ride-along: build identity for rollout/probes)
 //   - /.well-known/pki/*                           → no-auth (RFC 5280 CRL, RFC 6960 OCSP)
 //   - /.well-known/est/*                           → no-auth (RFC 7030 §3.2.3)
 //   - /scep, /scep/*                               → no-auth (RFC 8894 §3.2, CSR challengePassword)
 //   - /api/v1/*                                    → auth (Bearer token required)
 //   - /assets/*                                    → static file server (dashboard only)
 //   - anything else                                → SPA index.html fallback (dashboard only)
 //     OR apiHandler (no dashboard)
 //
 // EST/SCEP clients (IoT devices, 802.1X supplicants, MDM endpoints, network
 // appliances) cannot present certctl Bearer tokens, so those endpoints must be
 // reachable without the Auth middleware. Authentication is instead enforced by
 // CSR signature verification, profile policy gates, and for SCEP the
 // challengePassword shared secret (fail-loud gated by preflightSCEPChallengePassword
 // above).
 //
 // webDir must point to a directory containing index.html + assets/ when
 // dashboardEnabled is true; it is ignored otherwise.
 func buildFinalHandler(apiHandler, noAuthHandler http.Handler, webDir string, dashboardEnabled bool) http.Handler {
 	var fileServer http.Handler
 	if dashboardEnabled {
 		fileServer = http.FileServer(http.Dir(webDir))
 	}
 	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		path := r.URL.Path
 		// Health/ready, auth/info, and version bypass auth middleware.
 		// Health/ready: Docker/K8s health probes don't carry Bearer tokens.
 		// auth/info: React app calls this before login to detect auth mode.
 		// version: U-3 ride-along (cat-u-no_version_endpoint) — rollout
 		// systems and blackbox probes need build identity without a key.
 		if path == "/health" || path == "/ready" || path == "/api/v1/auth/info" || path == "/api/v1/version" {
 			noAuthHandler.ServeHTTP(w, r)
 			return
 		}
 		// RFC 5280 CRL and RFC 6960 OCSP live under /.well-known/pki/ and MUST
 		// be served unauthenticated — relying parties (browsers, OpenSSL, OCSP
 		// stapling sidecars, mTLS clients) cannot present certctl Bearer tokens.
 		if strings.HasPrefix(path, "/.well-known/pki") {
 			noAuthHandler.ServeHTTP(w, r)
 			return
 		}
 		// RFC 7030 EST endpoints ride the no-auth middleware chain (M-001,
 		// option D, audit 2026-04-19). Trust boundary is CSR signature +
 		// (per EST hardening Phase 2) optional client cert at the handler
 		// layer, not HTTP Bearer. /.well-known/est/cacerts is explicitly
 		// anonymous per RFC 7030 §4.1.1; /.well-known/est-mtls/<PathID>/
 		// (EST hardening Phase 2 sibling route) requires a client cert
 		// gate at the handler layer — both share this prefix gate because
 		// "/.well-known/est-mtls" is itself prefixed by "/.well-known/est".
 		// EST hardening Phase 3's HTTP Basic enrollment-password is a
 		// per-profile handler-layer auth that runs INSIDE the no-auth
 		// middleware chain (since the chain skips the Bearer middleware,
 		// the handler gets to define its own auth contract).
 		if strings.HasPrefix(path, "/.well-known/est") {
 			noAuthHandler.ServeHTTP(w, r)
 			return
 		}
 		// RFC 8894 SCEP rides the no-auth chain (M-001, option D). SCEP clients
 		// authenticate via the challengePassword attribute in the PKCS#10 CSR,
 		// not via HTTP Bearer tokens. preflightSCEPChallengePassword refuses to
 		// start the server if SCEP is enabled without a non-empty shared secret.
 		//
 		// SCEP RFC 8894 + Intune master bundle Phase 6.5: the sibling
 		// /scep-mtls[/<pathID>] route also rides the no-auth chain. Its
 		// auth boundary is (a) client cert verified at the TLS layer +
 		// re-verified per-profile at the handler layer, plus (b) the
 		// challenge password — neither is a Bearer token. The /scepxyz
 		// vs /scep-mtls disambiguation: 'xyz' starts with a letter so the
 		// HasPrefix(path, "/scep/") gate doesn't match it; 'mtls' is its
 		// own dedicated prefix gated below to avoid the same overlap.
 		if path == "/scep" || strings.HasPrefix(path, "/scep/") {
 			noAuthHandler.ServeHTTP(w, r)
 			return
 		}
 		if path == "/scep-mtls" || strings.HasPrefix(path, "/scep-mtls/") {
 			noAuthHandler.ServeHTTP(w, r)
 			return
 		}
 		// Authenticated API routes — full middleware stack including Auth.
 		if strings.HasPrefix(path, "/api/v1/") {
 			apiHandler.ServeHTTP(w, r)
 			return
 		}
 		if !dashboardEnabled {
 			// No dashboard: everything non-special falls through to the
 			// authenticated handler (preserves pre-M-001 behavior for API-only
 			// deployments).
 			apiHandler.ServeHTTP(w, r)
 			return
 		}
 		// Dashboard-present: serve static assets directly, SPA fallback for
 		// everything else.
 		if strings.HasPrefix(path, "/assets/") {
 			fileServer.ServeHTTP(w, r)
 			return
 		}
 		http.ServeFile(w, r, webDir+"/index.html")
 	})
 }
 // authPermissionCheckerAdapter bridges the typed-string Authorizer
 // signature (authsvc.Authorizer.CheckPermission takes
 // authdomain.ActorTypeValue + authdomain.ScopeType) to the plain-string
 // auth.PermissionChecker interface used by the auth.RequirePermission
 // middleware factory. Lives in cmd/server so internal/auth doesn't have
 // to import internal/service/auth + internal/domain/auth (would create
 // a cycle).
 type authPermissionCheckerAdapter struct {
 	a *authsvc.Authorizer
 }
 func (ad authPermissionCheckerAdapter) CheckPermission(
 	ctx context.Context,
 	actorID string,
 	actorType string,
 	tenantID string,
 	permission string,
 	scopeType string,
 	scopeID *string,
 ) (bool, error) {
 	return ad.a.CheckPermission(
 		ctx,
 		actorID,
 		authdomainAlias.ActorTypeValue(actorType),
 		tenantID,
 		permission,
 		authdomainAlias.ScopeType(scopeType),
 		scopeID,
 	)
 }
 // authCheckResolverAdapter bridges the postgres ActorRoleRepository
 // (authdomain.ActorTypeValue) to handler.AuthCheckResolver
 // (domain.ActorType). Lives in cmd/server so the handler layer keeps its
 // existing import set; the GUI's /v1/auth/check probe round-trips
 // through this on every page load. Read-only — no caller / no audit row.
 //
 // Bundle 1 Phase 3 closure (M1): the equivalent surface area on
 // /v1/auth/me runs through the service layer's auth.role.list permission
 // gate, which the GUI may not yet hold during initial render. AuthCheck
 // has no permission gate (its only requirement is "the request
 // authenticated"), so the bypass is by design.
 type authCheckResolverAdapter struct {
 	repo *postgres.ActorRoleRepository
 }
 func (ad authCheckResolverAdapter) ListRoles(
 	ctx context.Context,
 	actorID string,
 	actorType domain.ActorType,
 	tenantID string,
 ) ([]*authdomainAlias.ActorRole, error) {
 	return ad.repo.ListByActor(ctx, actorID, authdomainAlias.ActorTypeValue(actorType), tenantID)
 }
 func (ad authCheckResolverAdapter) EffectivePermissions(
 	ctx context.Context,
 	actorID string,
 	actorType domain.ActorType,
 	tenantID string,
 ) ([]repository.EffectivePermission, error) {
 	return ad.repo.EffectivePermissions(ctx, actorID, authdomainAlias.ActorTypeValue(actorType), tenantID)
 }
 // =============================================================================
 // sessionMinterAdapter — bridge from *session.Service to oidcsvc.SessionMinter.
 //
 // The OIDC service's SessionMinter port (Phase 3) takes a *userdomain.User
 // + role IDs and returns (cookie, csrf, err). The session.Service's
 // Create method takes (actorID, actorType, ip, ua) -> *CreateResult.
 // This adapter unwraps the User into actorID/actorType + reshapes the
 // return tuple. Lives in cmd/server so the session package doesn't have
 // to know about user.User and the user package doesn't have to know
 // about session.CreateResult.
 // =============================================================================
 type sessionMinterAdapter struct {
 	svc *session.Service
 }
 func (a *sessionMinterAdapter) MintForUser(
 	ctx context.Context,
 	user *userdomain.User,
 	_ []string, // roleIDs unused at the session-mint layer; the rbac middleware looks them up at request time
 	ip, userAgent string,
 ) (cookieValue, csrfToken string, err error) {
 	if user == nil {
 		return "", "", fmt.Errorf("session mint: user is nil")
 	}
 	res, err := a.svc.Create(ctx, user.ID, string(domain.ActorTypeUser), ip, userAgent)
 	if err != nil {
 		return "", "", err
 	}
 	return res.CookieValue, res.CSRFToken, nil
 }
 // silenceUnusedImports keeps the new oidcsvc + oidcdomain imports load-
 // bearing in case any file shuffles. Linker dead-code elimination handles
 // the runtime cost.
 var (
 	_ = oidcdomain.OIDCProvider{}
 )
 // =============================================================================
 // breakglassSessionMinterAdapter — bridge from *session.Service to
 // breakglass.SessionMinter.
 //
 // The break-glass service's SessionMinter port (Phase 7.5) returns
 // (cookie, csrf, err); the underlying *session.Service.Create returns
 // *CreateResult. This adapter unwraps the result. Lives in cmd/server
 // so the breakglass package doesn't have to know about session.Service.
 // =============================================================================
 type breakglassSessionMinterAdapter struct {
 	svc *session.Service
 }
 func (a breakglassSessionMinterAdapter) Create(ctx context.Context, actorID, actorType, ip, userAgent string) (string, string, error) {
 	res, err := a.svc.Create(ctx, actorID, actorType, ip, userAgent)
 	if err != nil {
 		return "", "", err
 	}
 	return res.CookieValue, res.CSRFToken, nil
 }
 // RevokeAllForActor — Audit 2026-05-10 HIGH-1 wire. After a break-glass
 // password rotation or credential removal, every active session for the
 // target actor must be revoked so a phished-then-rotated credential
 // doesn't leave the attacker's session live.
 func (a breakglassSessionMinterAdapter) RevokeAllForActor(ctx context.Context, actorID, actorType string) error {
 	return a.svc.RevokeAllForActor(ctx, actorID, actorType)
 }
 // oidcProvidersListAdapter bridges the postgres OIDCProviderRepository
 // to handler.OIDCProvidersListResolver. The handler returns
 // []*OIDCProviderInfo (id + display_name + login_url) for the public-
 // safe GUI Login-page payload; the repo returns the full OIDCProvider
 // row. The adapter projects + maps the login_url shape that
 // /auth/oidc/login?provider=<id> expects. Auth Bundle 2 Phase 6 /
 // Category E.
 type oidcProvidersListAdapter struct {
 	repo repository.OIDCProviderRepository
 }
 func (a oidcProvidersListAdapter) List(ctx context.Context, tenantID string) ([]*handler.OIDCProviderInfo, error) {
 	provs, err := a.repo.List(ctx, tenantID)
 	if err != nil {
 		return nil, err
 	}
 	out := make([]*handler.OIDCProviderInfo, 0, len(provs))
 	for _, p := range provs {
 		// Audit 2026-05-10 MED-9 closure — filter disabled providers
 		// at the adapter so the LoginPage's "Sign in with X" buttons
 		// don't render for offline IdPs. The HandleAuthRequest
 		// service-layer ErrProviderDisabled check is the
 		// defense-in-depth guard for direct API / MCP / CLI callers.
 		if !p.Enabled {
 			continue
 		}
 		out = append(out, &handler.OIDCProviderInfo{
 			ID:          p.ID,
 			DisplayName: p.Name,
 			LoginURL:    "/auth/oidc/login?provider=" + p.ID,
 		})
 	}
 	return out, nil
 }
@@ -417,11 +417,15 @@ Every `CERTCTL_*` environment variable is read by the server's `internal/config/
 | `CERTCTL_CORS_ORIGINS` | (empty) | Allowed CORS origins, comma-separated. Empty = deny all cross-origin |
 | `CERTCTL_RATE_LIMIT_RPS` | `10` | Requests per second per client |
 | `CERTCTL_RATE_LIMIT_BURST` | `20` | Burst allowance above RPS |
-| `CERTCTL_AGENT_BOOTSTRAP_TOKEN` | (empty) | Agent-registration bootstrap secret. Empty = v2.1.x warn-mode pass-through. Set to a real value (`openssl rand -base64 32`); the deny-empty flag's default flip in v2.2.0 will require it. |
+| `CERTCTL_RATE_LIMIT_BUCKET_TTL` | `1h` | Sprint 2 SEC-006: lifetime of an unused token-bucket entry. A background sweeper running every `BucketTTL/4` reclaims buckets whose last `allow()` call is older than this. Values < 1m clamp up to 1m. Lower when facing high-cardinality unauthenticated traffic (CGNAT churn, scanners) where the bucket-map RSS becomes a concern. |
-| `CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY` | `false` | Phase 2 SEC-H1 staged flag. When `true`, the server refuses to start unless `CERTCTL_AGENT_BOOTSTRAP_TOKEN` is non-empty. Default flip to `true` scheduled for v2.2.0. |
+| `CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT` | `1000` | Sprint 2 SCALE-001: cap on the number of Pending rows a single scheduler tick may claim via `ClaimPendingJobs`. Pre-Sprint-2 the scheduler claimed every Pending row in one transaction, which page-thrashed on 100K-job bursts. Values ≤ 0 fail-safe to `1000` (legacy unlimited semantics are no longer reachable). Pair-tune with `CERTCTL_RENEWAL_CONCURRENCY` (default 25) — the default 40:1 ratio keeps the fan-out busy without exhausting upstream-CA rate limits. |
 | `CERTCTL_AGENT_BOOTSTRAP_TOKEN` | (empty — required) | Agent-registration bootstrap secret. Set to a real value (`openssl rand -base64 32`). Sprint 5 ACQ RED-003 (2026-05-16) flipped the paired `_DENY_EMPTY` flag's default to `true`, so leaving this empty now refuses server start (unless `CERTCTL_DEMO_MODE_ACK=true`). Operators on v2.1.x reopening the warn-mode escape hatch one upgrade-window can set `CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY=false` explicitly. |
 | `CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY` | `true` | Phase 2 SEC-H1 fail-closed guard. When `true` (default since Sprint 5 ACQ RED-003 closure, 2026-05-16), the server refuses to start unless `CERTCTL_AGENT_BOOTSTRAP_TOKEN` is non-empty. Set to `false` only for a v2.1.x→v2.2.x upgrade-window warn-mode escape hatch. |
 | `CERTCTL_DEMO_MODE_ACK` | `false` | Acknowledges demo-mode synthetic admin posture (required when `CERTCTL_AUTH_TYPE=none` binds to a non-loopback host). Must be paired with `CERTCTL_DEMO_MODE_ACK_TS` per Phase 2 SEC-H3. |
 | `CERTCTL_DEMO_MODE_ACK_TS` | (empty) | Phase 2 SEC-H3: unix-epoch timestamp at which DemoModeAck was last acknowledged. When `CERTCTL_DEMO_MODE_ACK=true`, this must parse as a unix epoch within the last 24h. Set via `CERTCTL_DEMO_MODE_ACK_TS=$(date +%s)` at every `docker compose up`. |
 | `CERTCTL_ACME_INSECURE_ACK` | `false` | Phase 2 SEC-M4: explicit ACK required to boot with `CERTCTL_ACME_INSECURE=true`. Production deploys MUST never set either flag. |
 | `CERTCTL_DATABASE_MAX_CONNS` | `50` | Phase 6 SCALE-M1: max open DB connections in the server's pool. Default was `25` pre-Phase-6. Idle connections = max/5. Operator-tune ladder for larger fleets: ≤500 certs → 50; 5K certs → 100; 50K certs → 200 (also raise Postgres `max_connections`). See `docs/operator/scale.md`. |
 | `CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS` | (unset → 600) | Phase 6 SCALE-M3: process-wide override for the asyncpoll package's `DefaultMaxWait` (10 minutes). Caps total wall-clock time the certctl-server spends polling an async CA (DigiCert / Entrust / GlobalSign / Sectigo) before returning `StillPending` to the scheduler for re-enqueue. Per-connector overrides (`CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS`, etc.) take precedence when set. |
 ### Agent
@@ -116,8 +116,11 @@ services:
    networks:
      certctl-test:
        ipv4_address: 10.30.50.2
    # Acquisition-audit SEC-014 closure (Sprint 2, 2026-05-16).
    # Loopback-only host-port bind — the integration-test runner on
    # the host needs reachability, no other interface does.
    ports:
-      - "5432:5432"
+      - "127.0.0.1:5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U certctl -d certctl"]
      interval: 5s
@@ -261,6 +264,18 @@ services:
      CERTCTL_AUTH_TYPE: api-key
      CERTCTL_AUTH_SECRET: test-key-2026
      # Phase 2 SEC-H1 + Sprint 5 RED-003 closure (2026-05-16): the
      # AgentBootstrapTokenDenyEmpty fail-closed guard refuses to start
      # the server when CERTCTL_AGENT_BOOTSTRAP_TOKEN is empty (the
      # default DENY_EMPTY=true flipped on Sprint 5). Demo stacks
      # bypass the guard via CERTCTL_DEMO_MODE_ACK=true, but this is
      # the e2e TEST stack (production-like auth posture), not a demo
      # stack — set a deterministic placeholder token so the server
      # boots and the vendor-edge integration tests can run. Clearly
      # test-only; do NOT copy to production. Operators set this from
      # `openssl rand -base64 32` per docs/operator/security.md.
      CERTCTL_AGENT_BOOTSTRAP_TOKEN: test-agent-bootstrap-token-deterministic-fixture
      # Key generation — agent-side (production-like)
      CERTCTL_KEYGEN_MODE: agent
@@ -62,7 +62,13 @@ services:
  # handshake. ECDSA-P256 with SHA-256 is universally supported. See
  # docs/tls.md Pattern 1.
  certctl-tls-init:
-    image: alpine/openssl:latest
+    # DEPL-002 closure (Sprint 3, 2026-05-16): digest-pin so the
    # production-shaped compose has the same supply-chain posture as
    # the certctl Dockerfiles (which CI guards via digest-validity.sh).
    # The :latest tag floats; the digest is captured at the time
    # this comment was written. Bump after running the digest-
    # validity guard to confirm the new digest is still pullable.
    image: alpine/openssl:latest@sha256:41036db23542ed4cc09bc278d8a7e23b3da01690abb4b0e353b1bb87d70520ed
    container_name: certctl-tls-init
    restart: "no"
    entrypoint: /bin/sh
@@ -123,7 +129,12 @@ services:
  # `unhealthy` flap to cascade into certctl-server's `service_healthy`
  # depends_on, blocking the whole stack.
  postgres:
-    image: postgres:16-alpine
+    # DEPL-002 closure (Sprint 3, 2026-05-16): digest-pin matching the
    # alpine/openssl pin above. The `16-alpine` tag is the stable
    # major-version stream; the digest snapshots today's image so a
    # silent upstream rebuild can't slip into a production deploy
    # mid-rollout. Bump alongside dependency reviews.
    image: postgres:16-alpine@sha256:890480b08124ce7f79960a9bb16fe39729aa302bd384bfd7c408fee6c8f7adb7
    container_name: certctl-postgres
    environment:
      POSTGRES_DB: certctl
@@ -134,8 +145,18 @@ services:
      # default for screenshot/demo use; production deploys never
      # depend on that fallback.
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    # Acquisition-audit SEC-014 closure (Sprint 2, 2026-05-16). Bind
    # the published port to 127.0.0.1 ONLY — the certctl-server
    # connection comes in via the `certctl-network` Docker network
    # (the host-port mapping is operator convenience for psql / DB
    # inspection only). Pre-fix, the "5432:5432" form bound on
    # 0.0.0.0, exposing the Postgres TCP listener on every interface
    # of any host that happened to be on a public IP. The loopback
    # bind keeps host-side psql access working while preventing the
    # cross-network exposure landmine for compose deploys that aren't
    # behind a firewall.
    ports:
-      - "5432:5432"
+      - "127.0.0.1:5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
@@ -72,3 +72,28 @@ IMPORTANT NOTES FOR PRODUCTION:
   - All containers run as non-root
   - Implement network policies to restrict traffic between components
   - Consider pod security policies or security standards for your cluster
 {{- /*
  DEPL-006 closure (Sprint 3, 2026-05-16). Loud notice when the
  operator runs a multi-replica deploy without crossing the two
  required HA toggles. Per-pod rate-limit buckets and round-robin
  load balancing both silently break correctness above replicas:1.
 */}}
 {{- if gt (int .Values.server.replicas) 1 }}
 ⚠️  HA MISCONFIGURATION WARNINGS (replicas={{ .Values.server.replicas }}):
 {{- $backend := .Values.server.rateLimiting.backend | default "memory" }}
 {{- if eq $backend "memory" }}
   - server.rateLimiting.backend = "memory" with replicas > 1 gives each
     pod its own bucket map, so the configured cap is effectively
     multiplied by the replica count. Set
     `--set server.rateLimiting.backend=postgres` (see DEPL-006 /
     docs/operator/runbooks/ha.md).
 {{- end }}
 {{- if not .Values.server.service.sessionAffinity }}
   - server.service.sessionAffinity is empty. Round-robin Service load
     balancing routes login → /api/v1/auth/login → /api/v1/auth/csrf
     across different pods, breaking the CSRF token + session cookie
     handshake. Set
     `--set server.service.sessionAffinity=ClientIP`.
 {{- end }}
 {{- end }}
@@ -0,0 +1,178 @@
 {{- /*
 Phase 4 DEPL-H2 closure (2026-05-14): opt-in Helm CronJob for
 PostgreSQL backups.
 OPERATOR OPT-IN. Default `backup.enabled: false`. Turning it on
 requires:
  - In-cluster Postgres (this CronJob does NOT cover managed DB
    services — for AWS RDS / GCP CloudSQL / Azure DB rely on the
    provider's PITR).
  - A sink choice (PVC or S3) configured in values.yaml.
  - For S3: a Secret holding AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY
    (or use a service account with IRSA on EKS).
 The pg_dump invocation matches the canonical shape documented in
 docs/operator/runbooks/postgres-backup.md so a manual run and a
 CronJob run produce byte-identical dumps:
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl
 For sink choices beyond PVC + S3 (GCS, Azure Blob, NFS, restic, etc.),
 extend the `aws s3 cp` line below. The Job is intentionally minimal —
 it does ONE thing (capture + ship), not orchestrate retention or
 rotation. Off-host retention is the sink's responsibility (S3 lifecycle
 rules, PVC snapshot retention on the storage class, etc.).
 */ -}}
 {{- if .Values.backup.enabled }}
 apiVersion: batch/v1
 kind: CronJob
 metadata:
  name: {{ include "certctl.fullname" . }}-postgres-backup
  labels:
    {{- include "certctl.labels" . | nindent 4 }}
    app.kubernetes.io/component: postgres-backup
 spec:
  schedule: {{ .Values.backup.schedule | quote }}
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: {{ .Values.backup.successfulJobsHistoryLimit | default 3 }}
  failedJobsHistoryLimit: {{ .Values.backup.failedJobsHistoryLimit | default 1 }}
  startingDeadlineSeconds: {{ .Values.backup.startingDeadlineSeconds | default 300 }}
  jobTemplate:
    spec:
      backoffLimit: {{ .Values.backup.backoffLimit | default 1 }}
      activeDeadlineSeconds: {{ .Values.backup.activeDeadlineSeconds | default 3600 }}
      template:
        metadata:
          labels:
            {{- include "certctl.labels" . | nindent 12 }}
            app.kubernetes.io/component: postgres-backup
        spec:
          restartPolicy: Never
          {{- with .Values.imagePullSecrets }}
          imagePullSecrets:
            {{- toYaml . | nindent 12 }}
          {{- end }}
          serviceAccountName: {{ include "certctl.serviceAccountName" . }}
          securityContext:
            runAsUser: 1000
            runAsGroup: 1000
            runAsNonRoot: true
            fsGroup: 1000
          containers:
            - name: backup
              image: {{ .Values.backup.image | default "postgres:16-alpine" | quote }}
              imagePullPolicy: {{ .Values.backup.imagePullPolicy | default "IfNotPresent" | quote }}
              env:
                - name: PGHOST
                  value: {{ include "certctl.fullname" . }}-postgres
                - name: PGPORT
                  value: {{ .Values.postgresql.service.port | default 5432 | quote }}
                - name: PGUSER
                  valueFrom:
                    secretKeyRef:
                      name: {{ include "certctl.fullname" . }}-postgres
                      key: username
                - name: PGPASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: {{ include "certctl.fullname" . }}-postgres
                      key: password
                - name: PGDATABASE
                  valueFrom:
                    secretKeyRef:
                      name: {{ include "certctl.fullname" . }}-postgres
                      key: database
                {{- if eq (.Values.backup.sink | default "pvc") "s3" }}
                # S3 sink — operator provides AWS credentials via the
                # Secret referenced in backup.s3.credentialsSecret. The
                # credentials need s3:PutObject + s3:ListBucket on the
                # target bucket only; least-privilege per industry
                # standard.
                - name: AWS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: {{ .Values.backup.s3.credentialsSecret.name | quote }}
                      key: {{ .Values.backup.s3.credentialsSecret.accessKeyIdKey | default "AWS_ACCESS_KEY_ID" }}
                - name: AWS_SECRET_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: {{ .Values.backup.s3.credentialsSecret.name | quote }}
                      key: {{ .Values.backup.s3.credentialsSecret.secretAccessKeyKey | default "AWS_SECRET_ACCESS_KEY" }}
                {{- with .Values.backup.s3.region }}
                - name: AWS_DEFAULT_REGION
                  value: {{ . | quote }}
                {{- end }}
                {{- end }}
              command:
                - /bin/sh
                - -ceu
                - |
                  # Phase 4 DEPL-H2: canonical pg_dump shape per
                  # docs/operator/runbooks/postgres-backup.md.
                  # Custom-format compressed dump, no ownership /
                  # ACL embedded — produces a portable artifact
                  # restorable into any Postgres ≥ source major
                  # via `pg_restore -d certctl <dump>`.
                  set -euo pipefail
                  TIMESTAMP="$(date -u +%Y%m%dT%H%M%SZ)"
                  DUMP_FILE="/tmp/certctl-${TIMESTAMP}.dump"
                  echo "[backup-cronjob] capturing dump at ${TIMESTAMP}"
                  pg_dump --format=custom --no-owner --no-acl --dbname="${PGDATABASE}" \
                    > "${DUMP_FILE}"
                  # Integrity check — pg_restore --list parses the
                  # dump's table-of-contents; a corrupt dump fails
                  # here without shipping garbage off-host. Same
                  # check the manual runbook performs.
                  echo "[backup-cronjob] verifying dump integrity"
                  pg_restore --list "${DUMP_FILE}" > /dev/null
                  {{- if eq (.Values.backup.sink | default "pvc") "s3" }}
                  # S3 sink — requires aws-cli. The default
                  # postgres:16-alpine image does NOT include
                  # aws-cli; operators MUST set
                  # backup.image to an image that bundles both
                  # (e.g. ghcr.io/your-org/postgres-aws:16) OR
                  # override backup.command to install aws-cli at
                  # runtime. The line below assumes the image has
                  # `aws` on PATH.
                  S3_PATH="{{ .Values.backup.s3.bucket }}/{{ .Values.backup.s3.prefix | default "certctl" }}/certctl-${TIMESTAMP}.dump"
                  echo "[backup-cronjob] uploading to s3://${S3_PATH}"
                  aws s3 cp "${DUMP_FILE}" "s3://${S3_PATH}"
                  rm -f "${DUMP_FILE}"
                  {{- else }}
                  # PVC sink — dump lands at /backups/certctl-${TIMESTAMP}.dump
                  # mounted from backup.pvc.claimName. Retention is the
                  # PVC's responsibility (storage-class snapshot lifecycle
                  # or a separate cleanup CronJob). The Job moves the
                  # file from /tmp to /backups atomically; never
                  # writes partial dumps into the durable mount.
                  FINAL_PATH="/backups/certctl-${TIMESTAMP}.dump"
                  echo "[backup-cronjob] persisting to ${FINAL_PATH}"
                  mv "${DUMP_FILE}" "${FINAL_PATH}"
                  {{- end }}
                  echo "[backup-cronjob] done"
              {{- if ne (.Values.backup.sink | default "pvc") "s3" }}
              volumeMounts:
                - name: backups
                  mountPath: /backups
              {{- end }}
              resources:
                {{- toYaml (.Values.backup.resources | default dict) | nindent 16 }}
          {{- if ne (.Values.backup.sink | default "pvc") "s3" }}
          volumes:
            - name: backups
              persistentVolumeClaim:
                claimName: {{ .Values.backup.pvc.claimName | quote }}
          {{- end }}
          {{- with .Values.nodeAffinity }}
          affinity:
            nodeAffinity:
              {{- toYaml . | nindent 14 }}
          {{- end }}
          {{- with .Values.backup.tolerations }}
          tolerations:
            {{- toYaml . | nindent 12 }}
          {{- end }}
 {{- end }}
@@ -0,0 +1,89 @@
 {{- /*
 Phase 4 DEPL-M1 closure (2026-05-14): Helm pre-install / pre-upgrade
 hook that runs Postgres migrations before the server Deployment rolls.
 Pre-DEPL-M1, postgres.RunMigrations was invoked at server boot
 (cmd/server/main.go:151) as the only migration path. That works for
 Compose deployments but conflicts with Kubernetes rolling deploys:
 when a new server image lands with a schema change, multiple replicas
 race the migration during the rollout. The hook resolves the race by
 running migrations OUT OF BAND, exactly once, before any new server
 pod starts.
 How it works:
  - The Job ships the same certctl-server image as the Deployment, so
    the migration code path is binary-identical to the boot-time path.
  - It runs `certctl-server --migrate-only` (a flag the cmd/server
    main process must support — see cmd/server/main.go for the flag
    parse + early-exit path).
  - The CERTCTL_MIGRATIONS_VIA_HOOK=true env var is ALSO set on the
    server Deployment (via values.yaml). When the server boots, it
    sees this env var and skips its own RunMigrations call — the
    hook already did the work. Compose deploys don't set the env
    var, so they keep the boot-time path unchanged.
  - hook-delete-policy hook-succeeded means the Job is cleaned up
    automatically on success but retained on failure for operator
    diagnosis.
  - The hook-weight ensures the migration Job runs before any other
    pre-install/pre-upgrade resources (the StatefulSet's PVC has to
    exist first; in practice the StatefulSet has no hook so it lands
    naturally in the install phase after the Job completes).
 Operators on Compose: this hook is a no-op for you. The server still
 runs migrations at boot per the existing path.
 */ -}}
 {{- if .Values.migrations.viaHook }}
 apiVersion: batch/v1
 kind: Job
 metadata:
  name: {{ include "certctl.fullname" . }}-migrate
  labels:
    {{- include "certctl.labels" . | nindent 4 }}
    app.kubernetes.io/component: migration
  annotations:
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": hook-succeeded,before-hook-creation
 spec:
  backoffLimit: {{ .Values.migrations.backoffLimit | default 1 }}
  activeDeadlineSeconds: {{ .Values.migrations.activeDeadlineSeconds | default 600 }}
  template:
    metadata:
      labels:
        {{- include "certctl.labels" . | nindent 8 }}
        app.kubernetes.io/component: migration
    spec:
      restartPolicy: Never
      serviceAccountName: {{ include "certctl.serviceAccountName" . }}
      securityContext:
        {{- include "certctl.podSecurityContext" .Values.server.securityContext | nindent 8 }}
      {{- with .Values.imagePullSecrets }}
      imagePullSecrets:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      containers:
        - name: migrate
          image: {{ include "certctl.serverImage" . }}
          imagePullPolicy: {{ .Values.server.image.pullPolicy }}
          # Migration-only entrypoint. The server binary supports a
          # --migrate-only flag that runs postgres.RunMigrations +
          # postgres.RunSeed and exits cleanly (zero on success,
          # non-zero on migration failure). See cmd/server/main.go
          # for the implementation. The flag is hermetic — no HTTP
          # listener starts, no scheduler ticks, no signing
          # operations occur. Pure schema-mutation pass.
          command:
            - /app/server
            - --migrate-only
          env:
            - name: CERTCTL_DATABASE_URL
              value: {{ include "certctl.databaseURL" . | quote }}
            - name: CERTCTL_LOG_LEVEL
              value: {{ .Values.server.logging.level | default "info" | quote }}
            - name: CERTCTL_LOG_FORMAT
              value: {{ .Values.server.logging.format | default "json" | quote }}
          resources:
            {{- toYaml (.Values.migrations.resources | default .Values.server.resources) | nindent 12 }}
          securityContext:
            {{- include "certctl.containerSecurityContext" .Values.server.securityContext | nindent 12 }}
 {{- end }}
@@ -9,6 +9,21 @@ metadata:
 spec:
  serviceName: {{ include "certctl.fullname" . }}-postgres
  replicas: 1
  # Phase 4 DEPL-M4 closure (2026-05-14): explicit StatefulSet update +
  # pod-management strategies. Defaults make Postgres upgrades
  # operator-controlled rather than automatic:
  #   updateStrategy.type: OnDelete — Postgres pods do NOT roll
  #     automatically when the StatefulSet spec changes. Operator
  #     deletes the pod explicitly after taking a backup + reviewing
  #     the change. Prevents an accidental Helm-template tweak from
  #     triggering a database restart at an awkward time.
  #   podManagementPolicy: OrderedReady — when scaling Postgres to
  #     a replica >1 (future HA work), pods come up one at a time
  #     and must reach Ready before the next pod is created. Aligns
  #     with the standard Postgres-on-Kubernetes pattern.
  updateStrategy:
    type: OnDelete
  podManagementPolicy: OrderedReady
  selector:
    matchLabels:
      {{- include "certctl.postgresSelectorLabels" . | nindent 6 }}
@@ -0,0 +1,145 @@
 {{- /*
 Phase 4 DEPL-L2 closure (2026-05-14): opt-in Prometheus AlertManager
 rules covering the four operationally-actionable alerts every certctl
 deployment wants out of the box.
 OPERATOR OPT-IN. Default `monitoring.prometheusRules.enabled: false`.
 Turning it on requires Prometheus Operator CRDs (PrometheusRule kind)
 to be installed in-cluster. Without them this template renders an
 object Kubernetes will reject — keep the toggle off if you're scraping
 with vanilla Prometheus + a Helm-installed AlertManager rules
 ConfigMap instead.
 Metric names + thresholds verified against the actual
 internal/api/handler/metrics.go exposition path:
  - certctl_certificate_expiring_soon: server-side count of certs with
    ExpiresAt in (now, now + 30d]. The 30-day window is computed in
    internal/service/stats.go::GetDashboardSummary.
  - certctl_agent_online: agents with heartbeat in the last 5 minutes.
    A drop below certctl_agent_total signals offline agents.
  - certctl_job_failed_total + certctl_job_completed_total: cumulative
    counters; ratio gives the failure rate over the rate() window.
  - certctl_issuance_failures_total: cumulative counter of failed
    issuance attempts (renewal failures are issuance failures with a
    specific error_class label).
 Adjust thresholds per fleet — the defaults below are tuned for the
 demo dataset (15 certs / 1 agent) and may need raising for production
 fleets with thousands of certs where a steady rate of expiring certs
 is the normal operating state.
 */ -}}
 {{- if and .Values.monitoring.enabled .Values.monitoring.prometheusRules.enabled }}
 apiVersion: monitoring.coreos.com/v1
 kind: PrometheusRule
 metadata:
  name: {{ include "certctl.fullname" . }}-rules
  labels:
    {{- include "certctl.labels" . | nindent 4 }}
    app.kubernetes.io/component: monitoring
    {{- with .Values.monitoring.prometheusRules.labels }}
    {{- toYaml . | nindent 4 }}
    {{- end }}
 spec:
  groups:
    - name: certctl.alerts
      interval: {{ .Values.monitoring.prometheusRules.interval | default "60s" }}
      rules:
        # ---------------------------------------------------------------
        # Alert: CertctlCertificateExpiringSoon
        # Series: certctl_certificate_expiring_soon
        # The certctl-server counts certs with ExpiresAt in
        # (now, now + 30d] every metrics scrape. Fires whenever any cert
        # crosses into that window — operator must triage or extend
        # automation coverage. Rapid renewal infrastructure should keep
        # this number small in steady state.
        # ---------------------------------------------------------------
        - alert: CertctlCertificateExpiringSoon
          expr: certctl_certificate_expiring_soon > {{ .Values.monitoring.prometheusRules.thresholds.expiringCertificateCount | default 0 }}
          for: {{ .Values.monitoring.prometheusRules.thresholds.expiringCertificateFor | default "5m" }}
          labels:
            severity: warning
            component: certctl
          annotations:
            summary: "certctl: {{`{{ $value }}`}} certificate(s) expiring within 30 days"
            description: >-
              certctl_certificate_expiring_soon has been > {{ .Values.monitoring.prometheusRules.thresholds.expiringCertificateCount | default 0 }}
              for 5+ minutes. Investigate via
              /api/v1/certificates?status=expiring or the dashboard's
              Expiring tab. If renewal automation should have covered
              these, check the renewal scheduler logs for the cert IDs
              + the per-issuer failure rate.
        # ---------------------------------------------------------------
        # Alert: CertctlAgentOffline
        # Series: certctl_agent_total - certctl_agent_online
        # Agents flip from online → offline after 5 minutes without a
        # heartbeat (internal/service/stats.go::GetDashboardSummary).
        # The 1h `for:` window prevents a flapping agent from paging the
        # operator on every transient network blip.
        # ---------------------------------------------------------------
        - alert: CertctlAgentOffline
          expr: (certctl_agent_total - certctl_agent_online) > {{ .Values.monitoring.prometheusRules.thresholds.offlineAgentCount | default 0 }}
          for: {{ .Values.monitoring.prometheusRules.thresholds.offlineAgentFor | default "1h" }}
          labels:
            severity: warning
            component: certctl-agent
          annotations:
            summary: "certctl: {{`{{ $value }}`}} agent(s) offline for >1h"
            description: >-
              One or more certctl-agent instances have been without a
              heartbeat for over an hour. Check the agent logs on the
              affected hosts. If the agent host is intentionally
              decommissioned, retire the agent via the dashboard or
              POST /api/v1/agents/{id}/retire to suppress this alert.
        # ---------------------------------------------------------------
        # Alert: CertctlJobFailureRateHigh
        # Series: certctl_job_failed_total / (certctl_job_failed_total + certctl_job_completed_total)
        # Computes the failure rate over a 15-minute rate() window so
        # short bursts don't fire but a sustained issue does. The 5%
        # threshold is a conservative starter — adjust per fleet's
        # baseline.
        # ---------------------------------------------------------------
        - alert: CertctlJobFailureRateHigh
          expr: >-
            (
              rate(certctl_job_failed_total[15m])
              /
              clamp_min(rate(certctl_job_failed_total[15m]) + rate(certctl_job_completed_total[15m]), 1)
            ) > {{ .Values.monitoring.prometheusRules.thresholds.jobFailureRate | default 0.05 }}
          for: {{ .Values.monitoring.prometheusRules.thresholds.jobFailureRateFor | default "15m" }}
          labels:
            severity: warning
            component: certctl
          annotations:
            summary: "certctl: job failure rate above 5% over 15m"
            description: >-
              The 15m rate of certctl_job_failed_total / total jobs
              has been above 5% for 15+ minutes. Open
              /api/v1/jobs?status=failed to see the failing job IDs
              and root-cause the recurring error class.
        # ---------------------------------------------------------------
        # Alert: CertctlIssuanceFailures
        # Series: certctl_issuance_failures_total
        # Any non-zero rate of issuance failures over a 15m window is
        # operationally significant — a single CA outage or expired
        # ACME account can cascade across the fleet.
        # ---------------------------------------------------------------
        - alert: CertctlIssuanceFailures
          expr: rate(certctl_issuance_failures_total[15m]) > {{ .Values.monitoring.prometheusRules.thresholds.issuanceFailureRate | default 0 }}
          for: {{ .Values.monitoring.prometheusRules.thresholds.issuanceFailureFor | default "15m" }}
          labels:
            severity: warning
            component: certctl
          annotations:
            summary: "certctl: certificate issuance / renewal failures over 15m"
            description: >-
              certctl_issuance_failures_total has been incrementing
              over the last 15 minutes. Check the per-issuer breakdown
              via /api/v1/issuers + the failed-job log in
              /api/v1/jobs?status=failed. Common causes: CA
              outage, ACME account rate-limit, EAB credential
              expiration, stepca provisioner key rotation without
              certctl-side update.
 {{- end }}
@@ -12,6 +12,8 @@ data:
  keygen-mode: {{ .Values.server.keygen.mode | quote }}
  rate-limit-rps: {{ .Values.server.rateLimiting.rps | quote }}
  rate-limit-burst: {{ .Values.server.rateLimiting.burst | quote }}
  rate-limit-backend: {{ .Values.server.rateLimiting.backend | default "memory" | quote }}
  rate-limit-janitor-interval: {{ .Values.server.rateLimiting.janitorInterval | default "5m" | quote }}
  {{- if .Values.server.cors.origins }}
  cors-origins: {{ .Values.server.cors.origins | quote }}
  {{- end }}
@@ -51,6 +51,20 @@ spec:
              containerPort: {{ .Values.server.port }}
              protocol: TCP
          env:
            # DEPL-003 closure (Sprint 3, 2026-05-16). Pre-fix the
            # CERTCTL_MIGRATIONS_VIA_HOOK env var was documented in
            # values.yaml (L797-810) and migration-job.yaml comments
            # but was never rendered into the server Deployment env
            # block. With migrations.viaHook=true the operator's
            # intent is "the pre-install/pre-upgrade Helm Job owns
            # migrations" — but the server pods, missing the env,
            # ran their own boot-time RunMigrations alongside the
            # hook Job, racing on the schema lock. cmd/server/migrations.go
            # only short-circuits when this env is "true" (line 144).
            {{- if .Values.migrations.viaHook }}
            - name: CERTCTL_MIGRATIONS_VIA_HOOK
              value: "true"
            {{- end }}
            - name: CERTCTL_SERVER_HOST
              value: "0.0.0.0"
            - name: CERTCTL_SERVER_PORT
@@ -108,6 +122,19 @@ spec:
                configMapKeyRef:
                  name: {{ include "certctl.fullname" . }}-server
                  key: rate-limit-burst
            # Phase 13 Sprint 13.3 (ARCH-M1) — cross-replica-consistent
            # sliding-window rate limiter. Default memory; flip to
            # postgres when server.replicas > 1.
            - name: CERTCTL_RATE_LIMIT_BACKEND
              valueFrom:
                configMapKeyRef:
                  name: {{ include "certctl.fullname" . }}-server
                  key: rate-limit-backend
            - name: CERTCTL_RATE_LIMIT_JANITOR_INTERVAL
              valueFrom:
                configMapKeyRef:
                  name: {{ include "certctl.fullname" . }}-server
                  key: rate-limit-janitor-interval
            {{- if .Values.server.cors.origins }}
            - name: CERTCTL_CORS_ORIGINS
              valueFrom:
@@ -11,6 +11,23 @@ metadata:
  {{- end }}
 spec:
  type: {{ .Values.server.service.type }}
  {{- /*
    DEPL-006 closure (Sprint 3, 2026-05-16). Render the optional
    sessionAffinity field. docs/operator/runbooks/ha.md instructs
    operators to set sessionAffinity: ClientIP for replicas > 1 so
    login + CSRF flows stay on the same pod; pre-fix the chart did
    not actually pass the value through. sessionAffinityConfig
    clientIP.timeoutSeconds renders only when set, otherwise
    Kubernetes applies its default (10800s / 3h).
  */}}
  {{- if .Values.server.service.sessionAffinity }}
  sessionAffinity: {{ .Values.server.service.sessionAffinity }}
  {{- with .Values.server.service.sessionAffinityTimeoutSeconds }}
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: {{ . }}
  {{- end }}
  {{- end }}
  ports:
    - port: {{ .Values.server.service.port }}
      targetPort: https
@@ -42,15 +42,33 @@ spec:
      interval: {{ .Values.monitoring.serviceMonitor.interval | default "30s" }}
      scrapeTimeout: {{ .Values.monitoring.serviceMonitor.scrapeTimeout | default "10s" }}
      tlsConfig:
-        # The certctl server uses self-signed bootstrap TLS or operator-
+        {{- /*
-        # provided cert-manager TLS — the ServiceMonitor consumes the
+        Acquisition-audit DEPL-004 closure (Sprint 6 ACQ, 2026-05-16).
-        # same CA bundle the server presents. When server.tls.existingSecret
+        Pre-Sprint-6 the default was an implicit insecureSkipVerify
-        # is set, operators usually want to pull the matching ca.crt key
+        true via the template falling through the else branch.
-        # out of that Secret. Adjust if your CA chain lives elsewhere.
+        Post-Sprint-6 values.yaml ships a real-verify default
        (caFile + serverName matching the chart existingSecret /
        cert-manager-emitted Secret at /etc/prometheus/secrets/
        certctl-ca/), so the truthy if-branch below always fires for
        the default install. Operators who want skipVerify back must
        override with tlsConfig insecureSkipVerify true explicitly.
        Operators who blank tlsConfig entirely hit the else-branch
        below and trip the Helm fail directive at chart-render time;
        there is no way to inherit the pre-Sprint-6 implicit-skip
        behavior silently. See docs/operator/helm-deployment.md for
        the narrative explanation, including the lesson that comment
        text referencing Helm template-action delimiters must live
        in Helm-style comment blocks (this block), never in YAML
        hash-comment blocks — the Helm lexer scans for action
        delimiters everywhere in the source text, ignoring YAML
        comment markers, so descriptive references to actions inside
        YAML hash-comments are reinterpreted as template actions
        and abort the entire chart render.
        */ -}}
        {{- if .Values.monitoring.serviceMonitor.tlsConfig }}
        {{- toYaml .Values.monitoring.serviceMonitor.tlsConfig | nindent 8 }}
        {{- else }}
-        insecureSkipVerify: true
+        {{- fail "monitoring.serviceMonitor.tlsConfig was explicitly blanked but monitoring.serviceMonitor.enabled=true (Sprint 6 ACQ DEPL-004 closure, 2026-05-16). The values.yaml default ships caFile=/etc/prometheus/secrets/certctl-ca/ca.crt + serverName=certctl-server which matches the existingSecret mount pattern. If your Prometheus pod mounts the CA bundle at a different path, override caFile rather than blanking the block. If you genuinely need skipVerify, set tlsConfig insecureSkipVerify=true explicitly — never blank. See docs/operator/helm-deployment.md for the upgrade-path note." }}
        {{- end }}
      {{- with .Values.monitoring.serviceMonitor.bearerTokenSecret }}
      bearerTokenSecret:
@@ -31,6 +31,36 @@ server:
  port: 8443
  # Resource requests and limits
  #
  # Phase 4 DEPL-M5 (2026-05-14): per-fleet-size tuning ladder. The
  # default values below are validated against the demo dataset
  # (15 certs / 1 agent) and the baselines in
  # docs/operator/performance-baselines.md (single endpoint < 5s for
  # 100 sequential requests = ~50ms p50; cursor-paginated 1000-cert
  # inventory walk < 3s; renewal scan for 15 certs < 100ms).
  #
  # Larger fleet recommendations (TBD pending Phase 8 load-test runs;
  # operators tune empirically until then — capture readings in your
  # own loadtest-baselines log):
  #
  #   ≤ 500 certs / 100 agents:      defaults below                  (100m / 128Mi req, 500m / 512Mi lim)
  #   5K certs / 1K agents:          tune up — TBD Phase 8           (suggested starter: 500m / 512Mi req, 2000m / 2Gi lim)
  #   50K certs / 10K agents:        tune up — TBD Phase 8           (suggested starter: 2000m / 2Gi req, 4000m / 4Gi lim)
  #
  # The "suggested starter" values above are operator-tuning starting
  # points, NOT validated. Phase 8 (load test coverage expansion) will
  # measure them against synthetic fleets and replace the suggestions
  # with measured ceilings. Until then, treat them as a "raise CPU
  # before raising memory; raise both before scaling out" mental
  # model. Per docs/operator/performance-baselines.md, certctl-server
  # is CPU-bound on issuance / renewal scan work and memory-bound on
  # the inventory query path.
  #
  # Database scale (postgresql.* below) tracks server scale roughly
  # 1:1 — at 50K certs the Postgres instance needs 4 CPU / 4Gi RAM
  # and shared_buffers ≥ 1Gi. Postgres tuning is out of scope for
  # this comment; see docs/operator/runbooks/postgres-backup.md
  # for the production-tuning entry-point.
  resources:
    requests:
      cpu: 100m
@@ -130,6 +160,17 @@ server:
    type: ClusterIP
    port: 8443
    annotations: {}
    # DEPL-006 closure (Sprint 3, 2026-05-16). Optional sticky-session
    # routing. REQUIRED when server.replicas > 1 so login + CSRF token
    # rows stay on the same pod for the duration of a session — the
    # default round-robin load balancing breaks those flows. Set to
    # "ClientIP" for production HA (see deploy/helm/examples/values-prod-ha.yaml).
    # Leave empty for single-replica deploys.
    sessionAffinity: ""
    # When sessionAffinity is set, timeout window (in seconds) the
    # Service maps a source IP to the same pod. Default null →
    # Kubernetes applies its built-in default (10800s / 3h).
    sessionAffinityTimeoutSeconds: null
  # Authentication configuration.
  # Valid types: "api-key" (production) or "none" (demo only — disables
@@ -181,8 +222,25 @@ server:
  # Rate limiting configuration
  rateLimiting:
-    rps: 100      # Requests per second
+    rps: 100      # Requests per second (token-bucket middleware)
-    burst: 200    # Burst capacity
+    burst: 200    # Burst capacity (token-bucket middleware)
    # Sliding-window-log rate-limit backend (Phase 13 Sprint 13.2/13.3
    # ARCH-M1 closure). Selects the implementation backing the
    # break-glass / OCSP / cert-export / EST limiters. See
    # docs/operator/observability.md for the operator decision tree.
    #
    #   memory   — per-process (default; single-replica deploys).
    #   postgres — cross-replica-consistent via rate_limit_buckets.
    #              REQUIRED when server.replicas > 1 for accurate
    #              cluster-wide enforcement.
    backend: memory
    # Scheduler janitor interval for the postgres backend's
    # rate_limit_buckets sweep. Ignored when backend=memory (the
    # in-memory backend self-prunes on every Allow call).
    # Default 5m; minimum 1m.
    janitorInterval: "5m"
  # Network scanning configuration
  networkScan:
@@ -449,6 +507,27 @@ agent:
  replicas: 1
  # Resource requests and limits
  #
  # Phase 4 DEPL-M5 (2026-05-14): per-fleet-size tuning ladder for the
  # agent. Defaults are sized for the standard "one cert per host"
  # operating pattern: the agent polls the server every 30 seconds
  # (hardcoded in cmd/agent/main.go::pollInterval — not yet
  # env-configurable), generates ECDSA P-256 keys locally on
  # issuance/renewal events, and is otherwise idle. CPU is bursty only
  # during keygen + CSR submission.
  #
  # Tuning ladder (TBD pending Phase 8 — measure on your fleet):
  #
  #   1 cert / host (typical):        defaults below            (50m / 64Mi req, 200m / 256Mi lim)
  #   10 certs / host:                stays at defaults — agent is poll-driven, not work-bound by cert count
  #   100 certs / host (rare):        raise lim to 500m / 512Mi if you see throttling on issuance bursts
  #
  # The agent does NOT cache certs in memory — issuance is one-shot
  # generate-then-deploy. So per-host memory scales with whatever
  # truststore PEM bundles the agent's connectors load (Apache /
  # Postfix / similar), not with the cert count. Defaults are
  # appropriate for any "agent terminates ≤ 100 certs on this host"
  # deployment.
  resources:
    requests:
      cpu: 50m
@@ -601,17 +680,182 @@ monitoring:
    #     name: certctl-prometheus-key
    #     key: api-key
    # bearerTokenSecret: {}
-    # TLS config for the scrape endpoint. The certctl server presents
+    # TLS config for the scrape endpoint. Acquisition-audit DEPL-004
-    # the same TLS cert the rest of the chart uses; insecureSkipVerify
+    # closure (Sprint 6 ACQ, 2026-05-16): pre-Sprint-6 the default was
-    # defaults to true so demos work out of the box. Production deploys
+    # an implicit `insecureSkipVerify: true` (fell through the
-    # should pin the CA via caFile or ca.secret.
+    # template's else-branch). Post-Sprint-6 the default is a real
    # verify against the chart's CA at the canonical mount path the
    # existingSecret pattern produces (Prometheus mounts the
    # certctl-ca Secret as a volume at /etc/prometheus/secrets/
    # certctl-ca/). Operators whose Prometheus pod mounts the bundle
    # at a different path override `caFile` below; operators who
    # genuinely want skipVerify back can do so explicitly. Operators
    # who blank tlsConfig entirely (`tlsConfig: null` or
    # `tlsConfig: {}`) trip the `{{ fail }}` guard in
    # templates/servicemonitor.yaml at chart-render time — there is
    # no way to inherit the pre-Sprint-6 implicit-skipVerify behavior
    # silently.
    #
    # Production default (verify against the chart's CA):
    tlsConfig:
      caFile: /etc/prometheus/secrets/certctl-ca/ca.crt
      serverName: certctl-server
    #
    # Operator override — different CA mount path:
    # tlsConfig:
-    #   caFile: /etc/prometheus/secrets/certctl-ca/ca.crt
+    #   caFile: /path/to/your/ca.crt
-    #   serverName: certctl-server
+    #   serverName: your-cert-CN
-    # tlsConfig: {}
+    #
    # Operator override — demo / dev-cluster escape hatch
    # (operator-acknowledged unsafe):
    # tlsConfig:
    #   insecureSkipVerify: true
    # Optional relabeling for the scrape job.
    # relabelings: []
  # ----------------------------------------------------------------------
  # Phase 4 DEPL-L2 closure (2026-05-14): PrometheusRule (alert rules)
  #
  # Operator opt-in. Requires Prometheus Operator CRDs (the
  # `monitoring.coreos.com/v1` PrometheusRule kind) installed in
  # cluster. Without those CRDs the rendered object is rejected by
  # `kubectl apply` — keep enabled: false if you scrape with vanilla
  # Prometheus + AlertManager rules ConfigMap instead.
  #
  # Four starter rules ship out of the box (see
  # templates/prometheusrules.yaml for the full PromQL):
  #
  #   CertctlCertificateExpiringSoon — certs expiring within 30d
  #   CertctlAgentOffline             — agent without heartbeat for >1h
  #   CertctlJobFailureRateHigh       — job-failure rate over 5% (15m)
  #   CertctlIssuanceFailures         — any issuance failures in last 15m
  #
  # All thresholds are operator-tunable via the `thresholds:` block
  # below. The defaults are tuned for the demo dataset (15 certs / 1
  # agent); production fleets with sustained renewal volume MAY want
  # to raise the expiringCertificateCount + jobFailureRate thresholds
  # to suppress steady-state noise.
  prometheusRules:
    enabled: false
    # Evaluation interval for the rule group.
    interval: 60s
    # Additional labels applied to the PrometheusRule metadata.
    # labels: {}
    # Per-alert threshold / duration tunables.
    thresholds:
      # Fire when more than N certs are in the expiring-soon window.
      expiringCertificateCount: 0
      expiringCertificateFor: 5m
      # Fire when more than N agents are offline (server - online).
      offlineAgentCount: 0
      offlineAgentFor: 1h
      # Fire when job failure rate exceeds this fraction (15m window).
      jobFailureRate: 0.05
      jobFailureRateFor: 15m
      # Fire when issuance failure rate exceeds this value (15m window).
      issuanceFailureRate: 0
      issuanceFailureFor: 15m
 # ==============================================================================
 # Backup CronJob (Phase 4 DEPL-H2 closure, 2026-05-14)
 # ==============================================================================
 # Operator opt-in. Default OFF. The CronJob runs `pg_dump --format=custom
 # --no-owner --no-acl --dbname=certctl` matching the canonical shape
 # documented in docs/operator/runbooks/postgres-backup.md (so manual
 # and automated dumps are byte-identical) and ships the result to a
 # sink chosen below.
 #
 # DO NOT enable this for managed Postgres deployments (AWS RDS / GCP
 # Cloud SQL / Azure DB) — those have built-in PITR backup that this
 # CronJob cannot match. For in-cluster Postgres only.
 backup:
  enabled: false
  # Cron expression (UTC). Default: 02:30 UTC daily.
  schedule: "30 2 * * *"
  # Sink: "pvc" (default — dump lands on a PersistentVolumeClaim) or
  # "s3" (uploads via aws-cli — requires an image that bundles
  # aws-cli, see backup.image below).
  sink: pvc
  # Container image. The default postgres:16-alpine has pg_dump but
  # NOT aws-cli; for sink: s3 set this to an image that bundles both
  # (e.g. ghcr.io/your-org/postgres-aws:16) or override the Job's
  # command to install aws-cli at runtime.
  image: postgres:16-alpine
  imagePullPolicy: IfNotPresent
  # PVC sink config — used when sink: pvc.
  pvc:
    # Name of an existing PersistentVolumeClaim mounted at /backups
    # in the Job's pod. The PVC's storage class controls durability
    # and snapshot retention. Operator creates this PVC out of band
    # via their own storage policy.
    claimName: certctl-backups
  # S3 sink config — used when sink: s3.
  s3:
    # Target bucket (without s3:// prefix).
    bucket: ""
    # Object key prefix inside the bucket. Dumps land at
    # s3://<bucket>/<prefix>/certctl-<TIMESTAMP>.dump.
    prefix: certctl
    # AWS region (sets AWS_DEFAULT_REGION). Optional if the image's
    # AWS SDK can resolve the region another way (instance profile,
    # IRSA, etc.).
    region: ""
    # Secret holding AWS credentials. The IAM principal needs
    # s3:PutObject + s3:ListBucket on the target bucket only.
    credentialsSecret:
      name: certctl-backup-aws-creds
      accessKeyIdKey: AWS_ACCESS_KEY_ID
      secretAccessKeyKey: AWS_SECRET_ACCESS_KEY
  # Job housekeeping.
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  startingDeadlineSeconds: 300
  backoffLimit: 1
  activeDeadlineSeconds: 3600
  # Resource budget for the backup container. pg_dump is generally
  # memory-light; ~250MB RSS for fleets up to 100K certs is typical.
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi
  # Optional tolerations for the backup Job pod.
  tolerations: []
 # ==============================================================================
 # Migrations via Helm hook (Phase 4 DEPL-M1 closure, 2026-05-14)
 # ==============================================================================
 # When viaHook: true, the chart deploys templates/migration-job.yaml as
 # a pre-install + pre-upgrade hook that runs `certctl-server
 # --migrate-only` (a hermetic schema-mutation pass) before the server
 # Deployment rolls.
 #
 # Set CERTCTL_MIGRATIONS_VIA_HOOK=true in the server Deployment env to
 # tell the server to skip its boot-time RunMigrations call (the hook
 # already did the work; running again at boot would race across
 # replicas during rollouts).
 #
 # Default OFF — when off, the server runs migrations at boot exactly
 # as it always has (Compose deploys keep this path).
 migrations:
  viaHook: false
  # Job housekeeping.
  backoffLimit: 1
  activeDeadlineSeconds: 600
  # Resource budget for the migration Job pod. The migration pass is
  # I/O-bound on Postgres; matches the server's resource budget by
  # default. Override here if migrations on a large database need
  # more headroom than the steady-state server.
  # resources:
  #   requests:
  #     cpu: 100m
  #     memory: 128Mi
  #   limits:
  #     cpu: 500m
  #     memory: 512Mi
 # ==============================================================================
 # Network Policy (Bundle 3 closure / D11)
 # ==============================================================================
@@ -36,6 +36,14 @@ server:
  service:
    type: ClusterIP
    # DEPL-006 closure (Sprint 3, 2026-05-16): with replicas:3, the
    # default round-robin Service load balancing breaks login/CSRF
    # flows because the session cookie + the CSRF token row land on
    # different pods between requests. sessionAffinity: ClientIP
    # routes every connection from a given source IP to the same
    # pod for the configured timeout window. docs/operator/runbooks/ha.md
    # documents this; pre-fix the chart did not actually render it.
    sessionAffinity: ClientIP
    annotations:
      prometheus.io/scrape: "true"
      prometheus.io/port: "8443"
@@ -53,6 +61,14 @@ server:
  rateLimiting:
    rps: 500
    burst: 1000
    # DEPL-006 closure (Sprint 3, 2026-05-16): replicas > 1 REQUIRES
    # the postgres backend so per-key buckets are cross-replica-
    # consistent. The default 'memory' backend gives each pod its
    # own bucket map, so a 3-replica fleet effectively triples the
    # configured cap (a client churning across pods bypasses the
    # limit). See deploy/helm/certctl/values.yaml L217-226 for the
    # canonical comment.
    backend: postgres
 postgresql:
  enabled: true
@@ -0,0 +1,225 @@
 #!/usr/bin/env bash
 # Copyright 2026 certctl LLC. All rights reserved.
 # SPDX-License-Identifier: BUSL-1.1
 #
 # Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
 # 2026-05-16). Backup/restore smoke harness — orchestrates a real
 # pg_dump -Fc → DROP DATABASE → CREATE DATABASE → pg_restore loop
 # around the audit_events hash chain and asserts the chain head
 # round-trips byte-for-byte.
 #
 # This script is the body of the `.github/workflows/backup-restore.yml`
 # weekly job AND the same thing an operator can run locally against a
 # running Postgres to gain confidence before a real restore.
 #
 # Prereqs
 # =======
 # - psql / pg_dump / pg_restore installed and on PATH (ubuntu-latest
 #   ships postgresql-client by default; on macOS use Homebrew's
 #   libpq).
 # - A reachable Postgres at $PGHOST:$PGPORT, plus the certctl user +
 #   database created. In CI we point this at the GHA service container
 #   (postgres:16-alpine, pinned to the same digest as
 #   deploy/docker-compose.yml). Locally, point it wherever — the
 #   script DROPs the database it connects to, so DO NOT POINT THIS
 #   AT A DATABASE YOU CARE ABOUT.
 # - Go 1.25+ on PATH so the smoke program can be built. (CI's
 #   setup-go step handles this.)
 # - jq is NOT required — JSON snapshots are compared via python3.
 #
 # Behavior contract
 # =================
 # - On success: exit 0, prints "PASS" + a summary line.
 # - On any assertion failure: prints `::error::<reason>`, exits 1.
 #   (The ::error:: prefix is the GitHub Actions log-annotation shape;
 #    it surfaces as a red banner in the Actions run UI.)
 #
 # Non-goals
 # =========
 # - Does not exercise PITR / WAL archiving. The Sprint 4 scope is the
 #   pg_dump/pg_restore path only; managed-DB PITR is the operator's
 #   responsibility per docs/operator/runbooks/postgres-backup.md.
 # - Does not regenerate the audit chain after restore. A "restore
 #   that rewrote history" would mask exactly the bug under test.
 set -euo pipefail
 REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)"
 WORKDIR="$(mktemp -d)"
 trap 'rm -rf "$WORKDIR"' EXIT
 # ----------------------------------------------------------------------
 # Configuration — every knob is env-overridable so the same script
 # runs unchanged in CI (where the GHA service container exposes
 # 127.0.0.1:5432) and on an operator's laptop (where they may have
 # Postgres on a UNIX socket or a different port).
 # ----------------------------------------------------------------------
 : "${PGHOST:=127.0.0.1}"
 : "${PGPORT:=5432}"
 : "${PGUSER:=certctl}"
 : "${PGPASSWORD:=certctl}"
 : "${PGDATABASE:=certctl}"
 : "${SMOKE_ROWS:=24}"
 : "${MIGRATIONS_PATH:=${REPO_ROOT}/migrations}"
 # psql/pg_dump/pg_restore all read PG* env vars. Export so we don't
 # have to spell them out on every command line.
 export PGHOST PGPORT PGUSER PGPASSWORD PGDATABASE
 DB_URL="postgres://${PGUSER}:${PGPASSWORD}@${PGHOST}:${PGPORT}/${PGDATABASE}?sslmode=disable"
 fail() {
 	# GitHub Actions log annotation. The `::error::` prefix is what
 	# the Actions UI uses to highlight a line in the run log.
 	echo "::error::backup-restore-smoke: $*" >&2
 	exit 1
 }
 step() { printf '\n=== %s ===\n' "$*"; }
 # ----------------------------------------------------------------------
 # Sanity preflight
 # ----------------------------------------------------------------------
 step "preflight"
 command -v psql       >/dev/null || fail "psql not on PATH (install postgresql-client)"
 command -v pg_dump    >/dev/null || fail "pg_dump not on PATH"
 command -v pg_restore >/dev/null || fail "pg_restore not on PATH"
 command -v go         >/dev/null || fail "go not on PATH (need Go to build the smoke program)"
 command -v python3    >/dev/null || fail "python3 not on PATH (used for JSON diff)"
 test -d "${MIGRATIONS_PATH}" || fail "migrations dir not found: ${MIGRATIONS_PATH}"
 # Wait for Postgres readiness up to 60s. pg_isready returns 0 when
 # the server is accepting connections, so the loop is the canonical
 # CI-friendly "wait for the service container" pattern.
 step "waiting for postgres at ${PGHOST}:${PGPORT}"
 for _ in $(seq 1 60); do
 	if pg_isready -h "${PGHOST}" -p "${PGPORT}" -U "${PGUSER}" -d "${PGDATABASE}" -q; then
 		break
 	fi
 	sleep 1
 done
 pg_isready -h "${PGHOST}" -p "${PGPORT}" -U "${PGUSER}" -d "${PGDATABASE}" -q \
 	|| fail "postgres not ready after 60s at ${PGHOST}:${PGPORT}"
 # Wipe any prior state in the target DB. A previous failed run could
 # have left rows behind; the smoke contract is "starts from clean."
 step "wiping ${PGDATABASE} schema (DROP SCHEMA public CASCADE; CREATE SCHEMA public)"
 psql -v ON_ERROR_STOP=1 -c 'DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public; GRANT ALL ON SCHEMA public TO PUBLIC;'
 # ----------------------------------------------------------------------
 # Build the smoke program. We use `go run` to avoid leaving a binary
 # behind; the migrations + workload are quick so the per-invocation
 # compile cost is negligible.
 # ----------------------------------------------------------------------
 step "building smoke program"
 cd "${REPO_ROOT}"
 go build -o "${WORKDIR}/smoke" ./deploy/test/backupsmoke
 # ----------------------------------------------------------------------
 # Phase 1 — workload: migrate, insert rows, snapshot chain head.
 # ----------------------------------------------------------------------
 step "phase 1 — workload (${SMOKE_ROWS} audit_events rows)"
 "${WORKDIR}/smoke" \
 	--mode=workload \
 	--db-url="${DB_URL}" \
 	--migrations-path="${MIGRATIONS_PATH}" \
 	--rows="${SMOKE_ROWS}" \
 	| tee "${WORKDIR}/pre.json"
 # ----------------------------------------------------------------------
 # Phase 2 — backup. Canonical pg_dump shape per
 # deploy/helm/certctl/templates/backup-cronjob.yaml: --format=custom,
 # --no-owner, --no-acl. --no-owner / --no-acl keep the dump portable
 # across Postgres installations with different role layouts (the
 # audit-trail hash chain is data, not ACL state).
 # ----------------------------------------------------------------------
 step "phase 2 — pg_dump -Fc"
 pg_dump --format=custom --no-owner --no-acl --dbname="${PGDATABASE}" --file="${WORKDIR}/backup.dump"
 test -s "${WORKDIR}/backup.dump" || fail "pg_dump produced an empty file"
 # ----------------------------------------------------------------------
 # Phase 3 — wipe. The fresh-schema approach is the closest analogue
 # to "operator nuked the wrong volume." DROP DATABASE would require
 # connecting to a different DB and reconnect dance; DROP SCHEMA
 # achieves the same "no rows, no schema, no functions" end state
 # inside the existing connection and is restore-compatible (pg_dump
 # -Fc bundles the schema in the dump, so pg_restore recreates it).
 # ----------------------------------------------------------------------
 step "phase 3 — drop schema (simulating data-loss event)"
 psql -v ON_ERROR_STOP=1 -c 'DROP SCHEMA IF EXISTS public CASCADE; CREATE SCHEMA public; GRANT ALL ON SCHEMA public TO PUBLIC;'
 # Sanity: confirm audit_events is actually gone before restore. A
 # regression here (e.g. DROP SCHEMA silently no-op) would let the
 # verifier "succeed" by reading the original rows, making the test
 # false-pass.
 PRE_RESTORE_TABLES=$(psql -tAc "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public'")
 if [ "${PRE_RESTORE_TABLES}" -ne 0 ]; then
 	fail "post-DROP SCHEMA, expected 0 public tables; saw ${PRE_RESTORE_TABLES}"
 fi
 # ----------------------------------------------------------------------
 # Phase 4 — restore.
 # ----------------------------------------------------------------------
 step "phase 4 — pg_restore"
 pg_restore --dbname="${PGDATABASE}" --no-owner --no-acl --exit-on-error "${WORKDIR}/backup.dump"
 # ----------------------------------------------------------------------
 # Phase 5 — verify: re-snapshot, run audit_events_verify_chain().
 # ----------------------------------------------------------------------
 step "phase 5 — verify (audit_events_verify_chain() + snapshot)"
 "${WORKDIR}/smoke" \
 	--mode=verify \
 	--db-url="${DB_URL}" \
 	| tee "${WORKDIR}/post.json"
 # ----------------------------------------------------------------------
 # Phase 6 — assert.
 #
 #   pre.row_count       == post.row_count
 #   pre.chain_head_hash == post.chain_head_hash   (BYTE-EXACT)
 #   post.first_break_id == ""                     (verifier clean)
 #   post.verifier_walked == pre.row_count         (every row walked)
 #
 # Use python3 rather than jq so the script runs unchanged on macOS
 # without an extra Homebrew install.
 # ----------------------------------------------------------------------
 step "phase 6 — assertions"
 python3 - <<'PY' "${WORKDIR}/pre.json" "${WORKDIR}/post.json"
 import json, sys
 pre  = json.load(open(sys.argv[1]))
 post = json.load(open(sys.argv[2]))
 def bail(msg):
    print(f"::error::backup-restore-smoke: {msg}", file=sys.stderr)
    sys.exit(1)
 if pre["row_count"] != post["row_count"]:
    bail(f"row_count mismatch: pre={pre['row_count']} post={post['row_count']}")
 if pre["chain_head_hash"] != post["chain_head_hash"]:
    bail(
        "chain_head_hash mismatch — pg_dump/pg_restore did NOT round-trip the "
        "audit_events hash chain byte-for-byte. "
        f"pre={pre['chain_head_hash']} post={post['chain_head_hash']}"
    )
 if post.get("first_break_id", "") != "":
    bail(
        "audit_events_verify_chain() reports a break post-restore at id="
        f"{post['first_break_id']} pos={post.get('first_break_pos', '?')} — "
        "the chain is no longer self-consistent after the restore."
    )
 if post.get("verifier_walked", -1) != pre["row_count"]:
    bail(
        f"verifier_walked={post.get('verifier_walked')} != pre.row_count="
        f"{pre['row_count']} — verifier short-circuited or read stale rows."
    )
 print(
    f"PASS  rows={pre['row_count']}  "
    f"chain_head={pre['chain_head_hash'][:16]}…  "
    f"verifier=clean"
 )
 PY
@@ -0,0 +1,222 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 // Command backupsmoke is the workload+verifier half of the
 // backup/restore CI gate (acquisition-audit DEPL-005 + DATA-012
 // closure, Sprint 4 ACQ, 2026-05-16).
 //
 // The companion shell harness `deploy/test/backup-restore-smoke.sh`
 // orchestrates the dump/drop/restore lifecycle around two
 // invocations of this program: one before the backup
 // (--mode=workload) and one after the restore (--mode=verify). Both
 // emit a small JSON snapshot to stdout; the shell harness diffs them
 // and asserts the chain head + row count round-trip byte-for-byte.
 //
 // Modes
 // =====
 //
 //	--mode=workload
 //	  Run all up-migrations against `--migrations-path`, then
 //	  generate `--rows` (default 24) audit_events rows representing
 //	  an issue / renew / revoke / auth-login cycle. Emit a snapshot
 //	  with the post-workload row_count + chain head row_hash.
 //
 //	--mode=verify
 //	  Run `audit_events_verify_chain()` (the per-row hash-chain
 //	  verifier installed by migration 000047) and capture
 //	  first_break_id / first_break_pos / verifier_walked. Emit a
 //	  snapshot with row_count + chain head row_hash + verifier
 //	  output. No mutations.
 //
 // The CI assertion contract
 // =========================
 //
 // After (workload → pg_dump -Fc → DROP + CREATE → pg_restore →
 // verify), the shell asserts:
 //
 //	pre.row_count      == post.row_count
 //	pre.chain_head_hash == post.chain_head_hash   (byte-exact)
 //	post.first_break_id == ""                     (verifier clean)
 //
 // A pg_dump format-quirk that didn't preserve TIMESTAMPTZ
 // microseconds would surface as a chain-head mismatch (the
 // canonical payload re-formats `timestamp AT TIME ZONE 'UTC'` to
 // microsecond ISO-8601 — any precision loss breaks the hash). A
 // trigger-or-function regression would surface as a verifier non-
 // empty first_break_id. The test exists to PROVE these properties
 // under a real workload, not to defend against a known quirk.
 package main
 import (
 	"context"
 	"database/sql"
 	"encoding/json"
 	"flag"
 	"fmt"
 	"log"
 	"os"
 	"time"
 	_ "github.com/lib/pq"
 	"github.com/certctl-io/certctl/internal/repository/postgres"
 )
 // Snapshot is the on-the-wire shape emitted to stdout. The shell
 // orchestrator parses it via python3 -c 'json.load(...)' and diffs
 // the relevant fields. Keep it stable — any rename here must land
 // alongside a shell-harness change.
 type Snapshot struct {
 	Phase          string `json:"phase"`
 	RowCount       int    `json:"row_count"`
 	ChainHead      string `json:"chain_head_hash"`
 	FirstBreakID   string `json:"first_break_id,omitempty"`
 	FirstBreakPos  int    `json:"first_break_pos,omitempty"`
 	VerifierWalked int    `json:"verifier_walked,omitempty"`
 }
 func main() {
 	var (
 		mode           = flag.String("mode", "", "workload | verify")
 		dbURL          = flag.String("db-url", os.Getenv("DATABASE_URL"), "Postgres URL (or set DATABASE_URL)")
 		migrationsPath = flag.String("migrations-path", "./migrations", "Path to the migrations/ directory (workload mode only)")
 		rows           = flag.Int("rows", 24, "Number of audit_events rows to insert (workload mode only)")
 	)
 	flag.Parse()
 	if *dbURL == "" {
 		log.Fatal("--db-url or DATABASE_URL is required")
 	}
 	if *mode == "" {
 		log.Fatal("--mode is required (workload | verify)")
 	}
 	db, err := sql.Open("postgres", *dbURL)
 	if err != nil {
 		log.Fatalf("sql.Open: %v", err)
 	}
 	defer db.Close()
 	ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
 	defer cancel()
 	if err := db.PingContext(ctx); err != nil {
 		log.Fatalf("ping: %v", err)
 	}
 	switch *mode {
 	case "workload":
 		// Run all up-migrations end-to-end. The trigger + verifier
 		// function installed by migration 000047 must be in place
 		// before the inserts below; partial migration would mask a
 		// real bug.
 		if err := postgres.RunMigrations(db, *migrationsPath); err != nil {
 			log.Fatalf("RunMigrations(%s): %v", *migrationsPath, err)
 		}
 		if err := runWorkload(ctx, db, *rows); err != nil {
 			log.Fatalf("runWorkload: %v", err)
 		}
 		snap, err := snapshot(ctx, db, "workload", false)
 		if err != nil {
 			log.Fatalf("snapshot: %v", err)
 		}
 		emit(snap)
 	case "verify":
 		snap, err := snapshot(ctx, db, "verify", true)
 		if err != nil {
 			log.Fatalf("snapshot: %v", err)
 		}
 		emit(snap)
 	default:
 		log.Fatalf("unknown --mode=%q (workload | verify)", *mode)
 	}
 }
 // runWorkload inserts n audit_events rows representing an
 // issue / renew / revoke / auth-login cycle. Patterns mirror the
 // shape the application emits (see internal/service/audit_*.go),
 // so the canonical payload exercised here is representative.
 //
 // event_category is omitted on each INSERT — migration 000032 gave
 // the column DEFAULT 'cert_lifecycle', which is also the value the
 // application uses for cert lifecycle events. Auth rows get the
 // default too, which is harmless for the round-trip property under
 // test (only the canonical-payload byte sequence matters).
 //
 // Timestamps are monotonic via the `NOW() + ($interval ||
 // ' microsecond')::interval` pattern from
 // internal/repository/postgres/audit_chain_test.go — ordering
 // determinism is necessary for the chain head to be stable across
 // runs.
 func runWorkload(ctx context.Context, db *sql.DB, n int) error {
 	actions := []struct{ act, resType, resID string }{
 		{"certificate.issue", "certificate", "mc-smoke"},
 		{"certificate.renew", "certificate", "mc-smoke"},
 		{"certificate.revoke", "certificate", "mc-smoke"},
 		{"auth.login", "session", "sess-smoke"},
 	}
 	for i := 0; i < n; i++ {
 		a := actions[i%len(actions)]
 		id := fmt.Sprintf("audit-smoke-%04d", i)
 		_, err := db.ExecContext(ctx, `
 			INSERT INTO audit_events (
 				id, actor, actor_type, action,
 				resource_type, resource_id, details, timestamp
 			)
 			VALUES (
 				$1, 'smoke-actor', 'User', $2,
 				$3, $4, '{}'::jsonb,
 				NOW() + ($5 || ' microsecond')::interval
 			)
 		`, id, a.act, a.resType, a.resID, fmt.Sprintf("%d", i))
 		if err != nil {
 			return fmt.Errorf("insert row %d (%s): %w", i, id, err)
 		}
 	}
 	return nil
 }
 // snapshot reads the chain head + row count, optionally invoking
 // the on-demand verifier. Verifier output goes in three additional
 // fields so the workload-side snapshot can omit them via the
 // `omitempty` tag.
 func snapshot(ctx context.Context, db *sql.DB, phase string, runVerifier bool) (*Snapshot, error) {
 	s := &Snapshot{Phase: phase}
 	if err := db.QueryRowContext(ctx, `SELECT COUNT(*) FROM audit_events`).Scan(&s.RowCount); err != nil {
 		return nil, fmt.Errorf("count(audit_events): %w", err)
 	}
 	if err := db.QueryRowContext(ctx, `SELECT row_hash FROM audit_chain_head WHERE id = 1`).Scan(&s.ChainHead); err != nil {
 		return nil, fmt.Errorf("read audit_chain_head: %w", err)
 	}
 	if runVerifier {
 		var brokenID sql.NullString
 		var brokenPos, walked int
 		err := db.QueryRowContext(ctx, `
 			SELECT first_break_id, first_break_pos, row_count
 			FROM audit_events_verify_chain()
 		`).Scan(&brokenID, &brokenPos, &walked)
 		if err != nil {
 			return nil, fmt.Errorf("audit_events_verify_chain(): %w", err)
 		}
 		if brokenID.Valid {
 			s.FirstBreakID = brokenID.String
 		}
 		s.FirstBreakPos = brokenPos
 		s.VerifierWalked = walked
 	}
 	return s, nil
 }
 // emit pretty-prints the snapshot to stdout. The trailing newline
 // from json.Encoder is the right shape for both shell `tee` and
 // python3 stdin handling.
 func emit(s *Snapshot) {
 	enc := json.NewEncoder(os.Stdout)
 	enc.SetIndent("", "  ")
 	if err := enc.Encode(s); err != nil {
 		log.Fatalf("encode snapshot: %v", err)
 	}
 }
@@ -82,7 +82,17 @@ ARG LIBEST_REF
 # is the same major version libest r3.2.0 was tested against. libest
 # also wants libcurl + libsafec; we install both via apt rather than
 # building from source for reproducibility.
-RUN apt-get update && apt-get install --no-install-recommends -y \
+#
 # Hotfix #18 (2026-05-14): wrap in a 3-retry loop with --fix-missing
 # fallback to absorb transient Debian mirror flakes. The original
 # unwrapped apt-get install failed CI run #N on a "Connection reset
 # by peer" mid-fetch of libssh2-1 from fastly's debian.org mirror at
 # 151.101.202.132. Mirrors flake; production-grade Dockerfiles wrap
 # network ops in retry. Same pattern as the main Dockerfile's npm-ci
 # 3-retry loop from Hotfix #9.
 RUN for i in 1 2 3; do \
        apt-get update && \
        apt-get install --no-install-recommends -y --fix-missing \
            autoconf \
            automake \
            build-essential \
@@ -92,6 +102,10 @@ RUN apt-get update && apt-get install --no-install-recommends -y \
            libssl-dev \
            libtool \
            pkg-config \
        && break; \
        echo "apt-get install attempt $i/3 failed; sleeping 5s before retry"; \
        sleep 5; \
    done \
    && rm -rf /var/lib/apt/lists/*
 WORKDIR /src
@@ -172,13 +186,22 @@ RUN git clone --depth 1 --branch ${LIBEST_REF} https://github.com/cisco/libest.g
 # Pinned to the same digest as the builder above (Bundle A / H-001).
 FROM debian:bullseye-slim@sha256:1a4701c321b1d28b1ff5f0230e766791e4b79b1d4c6c7a70064f4b297b1a330f
-RUN apt-get update && apt-get install --no-install-recommends -y \
+# Hotfix #18 (2026-05-14): same 3-retry pattern as the builder stage
 # above. Runtime image installs are also vulnerable to transient
 # mirror flakes.
 RUN for i in 1 2 3; do \
        apt-get update && \
        apt-get install --no-install-recommends -y --fix-missing \
            bash \
            ca-certificates \
            curl \
            libcurl4 \
            libssl1.1 \
            openssl \
        && break; \
        echo "apt-get install attempt $i/3 failed; sleeping 5s before retry"; \
        sleep 5; \
    done \
    && rm -rf /var/lib/apt/lists/* \
    && useradd --create-home --uid 1000 estuser
@@ -0,0 +1,52 @@
 # loadtest-artifacts/
 > Last reviewed: 2026-05-16
 Long-term archive of k6 load-test results from the `loadtest` GitHub
 Actions workflow. TEST-005 closure (Sprint 5, 2026-05-16) introduces
 this directory as the committed home for captures the operator
 chooses to retain past GitHub's 90-day artifact-retention window.
 ## What lands here
 After a `loadtest` workflow_dispatch run, follow the procedure in
 [`docs/operator/scale-baseline-2026-Q2.md`](../../../docs/operator/scale-baseline-2026-Q2.md#capture-procedure):
 1. Download the three matrix-leg artifacts from the workflow page.
 2. Update the latest-capture table in the baseline doc with the
   extracted percentiles.
 3. Commit the raw artifacts you want long-term-retained here, named:
   ```
   2026-Q2-bulk-renewal-<run-id>.tar.gz
   2026-Q2-acme-burst-<run-id>.tar.gz
   2026-Q2-agent-storm-<run-id>.tar.gz
   ```
 4. If any single archive exceeds 100 MB, route it through `git lfs`
   (configured at repo root via `.gitattributes`).
 ## Why commit artifacts rather than rely on GHA retention
 - **GitHub Actions retains workflow artifacts for 90 days by default.**
  Acquisition-diligence reviewers looking at scale evidence months
  later get a 404 unless we keep the raw NDJSON in tree.
 - **Reproducibility.** Pinning the k6 NDJSON to a SHA makes it
  cheap to re-derive percentiles with a different filter (e.g.
  "p99 excluding the warmup ramp's first 30 seconds") without
  re-running the workflow.
 ## What does NOT belong here
 - **Per-PR ephemeral runs.** The `loadtest` workflow runs on
  workflow_dispatch + weekly cron; per-PR runs would be too noisy
  and aren't retained.
 - **Production-environment captures.** These artifacts are the
  ubuntu-latest reference baseline. An operator capturing their
  production-environment scale should put the artifacts in their
  own observability platform — committing them here would imply
  "this is what certctl's reference numbers are" which it isn't.
 - **Manual k6 captures from a developer's laptop.** Same rationale
  as the visual-regression snapshot runbook
  ([`docs/operator/runbooks/e2e-snapshot-update.md`](../../../docs/operator/runbooks/e2e-snapshot-update.md))
  — only the CI environment produces canonical numbers.
@@ -352,8 +352,35 @@ the ACME flow scenario. Operators with kind / cert-manager available
 should pair this with `make acme-cert-manager-test` for end-to-end
 verification.
 ## Scale tier (Phase 8 SCALE-H2, 2026-05-14)
 Phase 8 closure added three new k6 scenarios that exercise the
 scale-relevant load surfaces the API tier and connector tier left
 uncovered:
 | Scenario | k6 file | Seed | Make target |
 |---|---|---|---|
 | Bulk-renewal under load | `k6/bulk_renewal.js` | `seed/01_bulk_renewal_certs.sql` (10K certs) | `make loadtest-scale-bulk` |
 | ACME enrollment burst | `k6/acme_burst.js` | (none — unauth surface) | `make loadtest-scale-acme` |
 | Agent heartbeat storm | `k6/agent_storm.js` | `seed/02_agent_fleet.sql` (5K agents) | `make loadtest-scale-agent` |
 The scale-tier scenarios live behind the `scale` compose profile so
 the default `make loadtest` (API tier + connector tier, ~7 min)
 stays fast. Run all three serially with `make loadtest-scale`, or
 trigger the `loadtest.yml` workflow's `k6-scale` matrix jobs from
 the Actions tab for canonical-hardware capture.
 Operator-facing baseline table + threshold contracts + documented
 limitations live in [`docs/operator/scale.md`](../../../docs/operator/scale.md)
 under the "Scale-tier scenarios (SCALE-H2, Phase 8)" section. Treat
 that as the canonical source — this README only links.
 The seed fixtures + their idempotency contract are documented in
 [`seed/README.md`](seed/README.md).
 ## Audit references
 - API tier:       2026-05-01 issuer coverage audit fix #8.
 - Connector tier: 2026-05-02 deployment-target audit Bundle 10.
 - ACME flows:     Phase 5 master prompt (project notes).
 - Scale tier:     2026-05-14 architecture diligence Phase 8 (SCALE-H2).
@@ -351,3 +351,128 @@ services:
      - run
      - --summary-export=/results/summary.json
      - /scripts/k6.js
  # ===========================================================================
  # Phase 8 SCALE-H2 — scale-tier scenarios (opt-in via `--profile scale`).
  #
  # The default `make loadtest` path runs the API tier + connector tier
  # scenarios above against the demo-scale seed. The Phase 8 scenarios are
  # heavier (10K cert + 5K agent fixtures) and would slow the default path
  # without serving the per-PR signal the existing run targets, so they live
  # behind a separate compose profile.
  #
  # Three components, all profile-gated:
  #   1. scale-seed    — one-shot init that runs ./seed/*.sql against the
  #                      same postgres the server uses. Idempotent.
  #   2. k6-scale-bulk / k6-scale-acme / k6-scale-agent — one driver each
  #                      for the three Phase 8 scenarios. The matrix dispatch
  #                      in .github/workflows/loadtest.yml picks one per job.
  #
  # Run a single scale scenario locally:
  #   docker compose --profile scale up \
  #       --abort-on-container-exit --exit-code-from k6-scale-bulk \
  #       scale-seed k6-scale-bulk
  # ===========================================================================
  scale-seed:
    # postgres:16-alpine bundles psql; no extra image needed.
    image: postgres:16-alpine
    container_name: certctl-loadtest-scale-seed
    restart: "no"
    profiles: ["scale"]
    depends_on:
      postgres:
        condition: service_healthy
      # Wait for certctl-server to be healthy — the server runs schema
      # migrations + seed_demo.sql at boot. The Phase 8 seeds reference
      # FKs (iss-local, o-alice, t-platform, rp-standard) that
      # seed_demo.sql creates, so the order MUST be:
      #   postgres up → server runs migrations + seed_demo.sql → scale-seed runs
      certctl-server:
        condition: service_healthy
    environment:
      PGHOST: postgres
      PGUSER: certctl
      PGPASSWORD: loadtestpass
      PGDATABASE: certctl
    volumes:
      - ./seed:/seed:ro
    entrypoint: /bin/sh
    command:
      - -c
      - |
        set -eu
        echo "==> Phase 8 scale-seed: running SQL fixtures (lexical order)"
        for f in /seed/*.sql; do
            echo "----> $$f"
            psql -v ON_ERROR_STOP=1 -f "$$f"
        done
        echo "==> Phase 8 scale-seed: complete"
  k6-scale-bulk:
    image: grafana/k6:0.54.0
    container_name: certctl-loadtest-k6-bulk
    profiles: ["scale"]
    depends_on:
      certctl-server:
        condition: service_healthy
      scale-seed:
        condition: service_completed_successfully
    environment:
      CERTCTL_BASE: https://certctl-server:8443
      CERTCTL_TOKEN: load-test-token
      K6_INSECURE_SKIP_TLS_VERIFY: "true"
    volumes:
      - ./k6/bulk_renewal.js:/scripts/bulk_renewal.js:ro
      - ./results:/results
    command:
      - run
      - --summary-export=/results/summary-bulk-renewal.json
      - /scripts/bulk_renewal.js
  k6-scale-acme:
    image: grafana/k6:0.54.0
    container_name: certctl-loadtest-k6-acme
    profiles: ["scale"]
    depends_on:
      certctl-server:
        condition: service_healthy
      # ACME scenario doesn't depend on the SQL seeds (it hits the
      # unauthenticated directory + nonce + ARI surface) but routing
      # it through the same dependency chain keeps the compose
      # ordering predictable across the three scale jobs.
      scale-seed:
        condition: service_completed_successfully
    environment:
      CERTCTL_ACME_DIRECTORY: https://certctl-server:8443/acme/profile/prof-test/directory
      K6_INSECURE_SKIP_TLS_VERIFY: "true"
    volumes:
      - ./k6/acme_burst.js:/scripts/acme_burst.js:ro
      - ./results:/results
    command:
      - run
      - --summary-export=/results/summary-acme-burst.json
      - /scripts/acme_burst.js
  k6-scale-agent:
    image: grafana/k6:0.54.0
    container_name: certctl-loadtest-k6-agent
    profiles: ["scale"]
    depends_on:
      certctl-server:
        condition: service_healthy
      scale-seed:
        condition: service_completed_successfully
    environment:
      CERTCTL_BASE: https://certctl-server:8443
      CERTCTL_TOKEN: load-test-token
      K6_INSECURE_SKIP_TLS_VERIFY: "true"
      # Match the seed's 5K-agent fleet.
      K6_AGENT_FLEET: "5000"
    volumes:
      - ./k6/agent_storm.js:/scripts/agent_storm.js:ro
      - ./results:/results
    command:
      - run
      - --summary-export=/results/summary-agent-storm.json
      - /scripts/agent_storm.js
@@ -0,0 +1,183 @@
 // Phase 8 SCALE-H2 — ACME enrollment burst.
 //
 // What this measures:
 //   200 concurrent VUs hammering the unauthenticated ACME directory
 //   + new-nonce + ARI surface for 5 minutes. The goal is the
 //   throughput ceiling for the entry-point handlers and the
 //   per-account rate-limit response shape Phase 5 added (RFC 8555
 //   §6.7 + RFC 7807 + the certctl-specific
 //   ErrACMEConcurrentOrdersExceeded path).
 //
 // What this does NOT measure (and why):
 //   - JWS-signed POST flows (new-account, new-order, finalize).
 //     k6 doesn't ship JWS, and bundling a Go signing helper into
 //     the k6 container would obscure the server-side latency the
 //     scenario is trying to pin. The existing
 //     `deploy/test/loadtest/k6/acme_flow.js` Phase 5 scenario
 //     made the same explicit trade-off; this Phase 8 burst scenario
 //     reuses the constraint. End-to-end JWS-signed conformance is
 //     gated by `make acme-rfc-conformance-test` (which uses lego
 //     against the same compose stack).
 //   - The actual order/finalize hot path. The newOrder handler's
 //     constant-time SCAN against acme_orders + the per-account
 //     concurrent-orders gate ARE useful to load-test, but require
 //     valid JWS to reach. The directory + new-nonce surface this
 //     scenario hits is what every ACME client transits BEFORE the
 //     signed flow — measuring it pins the server's headroom for
 //     the rest of the flow.
 //   - Issuer-side enrollment latency (DigiCert ACME, Let's Encrypt
 //     against a real prod CA, etc.). Same "load-testing someone
 //     else's API" carve-out as the API tier.
 //
 // What this DOES measure:
 //   - GET /acme/profile/{id}/directory throughput. Sustained 200
 //     concurrent VUs at a low per-VU sleep produces ~600-1000 req/s
 //     against this endpoint, well above what any production ACME
 //     client would generate but the right shape for finding the
 //     ceiling.
 //   - HEAD /acme/profile/{id}/new-nonce throughput. Nonce
 //     allocation is a hot path that writes one row to acme_nonces.
 //   - GET /acme/profile/{id}/renewal-info/{cert-id} 4xx fast path.
 //     Synthetic cert-id → handler returns 4xx without a DB lookup
 //     (cert-id is malformed at the parse layer). Measures the
 //     handler-front overhead under load.
 //   - 429 rate-limit response shape. The Phase 5 ACME per-account
 //     rate limit fires at sustained spike rates; the scenario pins
 //     that the 429 body is RFC 7807 with the
 //     "urn:ietf:params:acme:error:rateLimited" type. A regression
 //     that returned a plain text 429 or a different problem type
 //     would break ACME clients hard.
 //
 // Threshold contract:
 //   - directory p95 < 500ms, new-nonce p95 < 300ms, renewal-info
 //     p95 < 800ms — same as the Phase 5 acme_flow.js baselines.
 //   - 429 responses are EXPECTED at sustained 200 VU rate (the
 //     server's RFC-compliant rate limiter SHOULD kick in). The
 //     http_req_failed metric is tagged separately so 429s don't
 //     break the threshold; a separate `rate_limited` Counter
 //     tracks them so the operator can see how often the limiter
 //     fires.
 import http from 'k6/http';
 import { check } from 'k6';
 import { Counter, Trend } from 'k6/metrics';
 import { textSummary } from 'https://jslib.k6.io/k6-summary/0.0.2/index.js';
 const ACME_BASE = __ENV.CERTCTL_ACME_DIRECTORY ||
    'https://certctl-server:8443/acme/profile/prof-test/directory';
 // Custom metrics.
 const directoryDuration = new Trend('acme_directory_duration', true);
 const newNonceDuration  = new Trend('acme_new_nonce_duration', true);
 const renewalInfoDuration = new Trend('acme_renewal_info_duration', true);
 const rateLimitedCount  = new Counter('acme_rate_limited_count');
 const rateLimitShapeOK  = new Counter('acme_rate_limit_shape_ok');
 export const options = {
    scenarios: {
        acme_burst: {
            executor: 'constant-vus',
            vus: parseInt(__ENV.K6_ACME_VUS || '200', 10),
            duration: __ENV.K6_ACME_DURATION || '5m',
            gracefulStop: '30s',
            tags: { scenario: 'acme_burst' },
        },
    },
    thresholds: {
        'acme_directory_duration':    ['p(95)<500'],
        'acme_new_nonce_duration':    ['p(95)<300'],
        'acme_renewal_info_duration': ['p(95)<800'],
        // 4xx (rate-limited or malformed-cert-id) is expected; 5xx is
        // not. Filter to status >= 500 for the failure floor.
        'http_req_failed{scenario:acme_burst,server_error:true}': ['rate<0.001'],
    },
    insecureSkipTLSVerify: true,
    summaryTrendStats: ['avg', 'min', 'med', 'p(95)', 'p(99)', 'max'],
 };
 export default function () {
    // Step 1 — directory.
    let res = http.get(ACME_BASE, {
        tags: { scenario: 'acme_burst', step: 'directory' },
    });
    directoryDuration.add(res.timings.duration);
    check(res, { 'directory 200': (r) => r.status === 200 });
    if (res.status === 429) {
        recordRateLimit(res);
        return; // backoff this VU iteration
    }
    if (res.status !== 200) return;
    const dir = res.json();
    // Step 2 — new-nonce.
    if (dir.newNonce) {
        res = http.head(dir.newNonce, {
            tags: { scenario: 'acme_burst', step: 'new_nonce' },
        });
        newNonceDuration.add(res.timings.duration);
        if (res.status === 429) {
            recordRateLimit(res);
            return;
        }
        check(res, {
            'new-nonce 200': (r) => r.status === 200,
            'replay-nonce header present': (r) => !!r.headers['Replay-Nonce'],
        });
    }
    // Step 3 — ARI synthetic 4xx fast path. Phase 4 added ARI
    // (RFC 9773); this exercises the malformed-cert-id branch which
    // returns a 4xx without a DB lookup. Pinning this here means a
    // regression that turned the malformed path into a DB query
    // would surface as a p95 spike.
    if (dir.renewalInfo) {
        res = http.get(dir.renewalInfo + '/aaaa.bbbb', {
            tags: { scenario: 'acme_burst', step: 'renewal_info' },
        });
        renewalInfoDuration.add(res.timings.duration);
        if (res.status === 429) {
            recordRateLimit(res);
            return;
        }
        check(res, {
            'renewal-info 4xx for synthetic cert-id':
                (r) => r.status === 400 || r.status === 404,
        });
    }
 }
 // recordRateLimit pins the Phase 5 ACME rate-limit response shape:
 //   - HTTP 429
 //   - Content-Type: application/problem+json
 //   - Body: {"type":"urn:ietf:params:acme:error:rateLimited", ...}
 // A regression that returned 503 or a plain-text 429 or a different
 // problem type would NOT increment acme_rate_limit_shape_ok and the
 // operator would see (rate_limited_count - shape_ok_count) > 0 in
 // the summary.
 function recordRateLimit(res) {
    rateLimitedCount.add(1);
    const ct = res.headers['Content-Type'] || '';
    if (!ct.includes('application/problem+json')) {
        return;
    }
    let body;
    try {
        body = res.json();
    } catch (e) {
        return;
    }
    if (body && typeof body.type === 'string' &&
        body.type.startsWith('urn:ietf:params:acme:error:rateLimited')) {
        rateLimitShapeOK.add(1);
    }
 }
 export function handleSummary(data) {
    return {
        '/results/summary-acme-burst.json': JSON.stringify(data, null, 2),
        '/results/summary-acme-burst.txt': textSummary(data, { indent: ' ', enableColors: false }),
        stdout: textSummary(data, { indent: ' ', enableColors: true }),
    };
 }
@@ -0,0 +1,126 @@
 // Phase 8 SCALE-H2 — agent fleet heartbeat storm.
 //
 // What this measures:
 //   5,000 agents heartbeating at 30s intervals = ~167 heartbeats/sec
 //   sustained. Each heartbeat is POST /api/v1/agents/{id}/heartbeat
 //   with optional metadata. Pre-seeded fleet provided by
 //   deploy/test/loadtest/seed/02_agent_fleet.sql.
 //
 // What this does NOT measure:
 //   - The agent work-poll path (GET /api/v1/agents/{id}/work). The
 //     heartbeat hot path is the highest-frequency call on a typical
 //     fleet (work-poll cadence is 30s default like heartbeat, but
 //     work-poll returns the empty set 99% of the time and is cheap;
 //     heartbeat does an UPDATE on every call). v2 of the harness
 //     could combine them.
 //   - The agent CSR-submit path (POST /api/v1/agents/{id}/csr). That
 //     fires on per-cert issuance, not per heartbeat, and is exercised
 //     by the existing API tier's POST /api/v1/certificates scenario.
 //   - Auth-key per-agent rotation. The loadtest stack runs with a
 //     single api-key (`load-test-token`); per-agent api-key
 //     hashing/rotation isn't a load axis.
 //
 // Why constant-arrival-rate (not constant-vus):
 //   The point is to model what 5K real agents would offer the server
 //   at their native cadence. 5K agents * (1 heartbeat / 30s) =
 //   166.67 req/s offered. constant-arrival-rate fires at exactly
 //   that rate regardless of latency; if the server backpressures,
 //   queue builds and p99 shows it. constant-vus would let slow
 //   responses block, masking the actual ceiling.
 //
 // Threshold contract:
 //   - p99 < 1s for the heartbeat POST. The handler does an UPDATE on
 //     agents.last_heartbeat_at (+ optional metadata columns) and an
 //     RBAC check. Even at 200 req/s a tight UPDATE on an indexed
 //     primary key should stay sub-second.
 //   - p95 < 500ms.
 //   - Error rate < 0.1%. The seeded agents are all status='Online'
 //     so no 410 Gone (retired-agent) responses; anything 4xx is a
 //     bug. 5xx is a server health regression.
 //
 // Phase 8 reference:
 //   - Source finding: SCALE-H2.
 //   - Pre-state: heartbeat path not load-tested. The 100-agent demo
 //     seed in seed_demo.sql produces ~3 heartbeats/sec, orders of
 //     magnitude below fleet scale.
 import http from 'k6/http';
 import { check } from 'k6';
 import { textSummary } from 'https://jslib.k6.io/k6-summary/0.0.2/index.js';
 const BASE  = __ENV.CERTCTL_BASE  || 'https://certctl-server:8443';
 const TOKEN = __ENV.CERTCTL_TOKEN || 'load-test-token';
 // 5000 agents * (1 / 30s) = 166.67 heartbeats/sec. Round to 167.
 const TARGET_RATE = parseInt(__ENV.K6_AGENT_RATE || '167', 10);
 // Total agents in the fleet seed. The k6 scenario picks an agent at
 // random per iteration (deterministic via __ITER) to spread the
 // per-row UPDATE pressure across the table.
 const FLEET_SIZE = parseInt(__ENV.K6_AGENT_FLEET || '5000', 10);
 export const options = {
    scenarios: {
        agent_storm: {
            executor: 'constant-arrival-rate',
            rate: TARGET_RATE,
            timeUnit: '1s',
            duration: '5m',
            preAllocatedVUs: 50,
            maxVUs: 200,
            exec: 'heartbeat',
            tags: { scenario: 'agent_storm' },
        },
    },
    thresholds: {
        'http_req_duration{scenario:agent_storm}': ['p(99)<1000', 'p(95)<500'],
        'http_req_failed{scenario:agent_storm}': ['rate<0.001'],
    },
    summaryTrendStats: ['avg', 'min', 'med', 'p(95)', 'p(99)', 'max'],
    insecureSkipTLSVerify: true,
 };
 // agentID returns a deterministic agent id from the loadtest fleet
 // seed. Spreading round-robin across the fleet means the UPDATE
 // pressure hits every row equally rather than the same hot row over
 // and over.
 function agentID() {
    // __ITER is k6's per-VU iteration counter; combined with __VU
    // (the VU index) we get a unique-per-call number that spans
    // 0..FLEET_SIZE on the modulo.
    const idx = (__VU * 1000 + __ITER) % FLEET_SIZE;
    return 'ag-loadtest-' + String(idx + 1).padStart(5, '0');
 }
 export function heartbeat() {
    const id = agentID();
    // Optional metadata; the heartbeat handler tolerates an empty body
    // (no metadata) but real agents send their version + hostname on
    // every call so we include them here.
    const payload = JSON.stringify({
        version: '2.1.0',
        hostname: 'loadtest-' + id.slice(-5) + '.fleet.example.test',
        os: 'linux',
        architecture: 'amd64',
    });
    const res = http.post(`${BASE}/api/v1/agents/${id}/heartbeat`, payload, {
        headers: {
            'Content-Type': 'application/json',
            'Authorization': `Bearer ${TOKEN}`,
        },
        tags: { scenario: 'agent_storm' },
    });
    check(res, {
        'heartbeat 2xx': (r) => r.status >= 200 && r.status < 300,
    });
 }
 export function handleSummary(data) {
    return {
        '/results/summary-agent-storm.json': JSON.stringify(data, null, 2),
        '/results/summary-agent-storm.txt': textSummary(data, { indent: ' ', enableColors: false }),
        stdout: textSummary(data, { indent: ' ', enableColors: true }),
    };
 }
@@ -0,0 +1,129 @@
 // Phase 8 SCALE-H2 — bulk-renewal under load.
 //
 // What this measures:
 //   POST /api/v1/certificates/bulk-renew throughput against a
 //   10K-cert pre-seeded fleet. Each iteration POSTs a criteria-mode
 //   bulk-renew request scoped to a subset of the seeded fleet (by
 //   tag) so the server enqueues N renewal jobs and returns a
 //   per-cert {certificate_id, job_id} envelope.
 //
 // Why criteria-mode (not certificate-ids mode):
 //   The seeded fleet has a stable `tags.batch = 'bulk-renewal'`
 //   marker. Criteria-mode lets the scenario re-fire without
 //   maintaining a moving list of cert IDs and still scopes the
 //   action to the Phase 8 fixture (no risk of touching a real
 //   tenant's certs if someone runs the scenario against a non-
 //   loadtest server by mistake — the criteria simply matches
 //   nothing).
 //
 // What this does NOT measure:
 //   - The scheduler's renewal scan itself. The bulk-renew handler
 //     enqueues issuance jobs synchronously into the `jobs` table;
 //     the scheduler's `jobProcessorLoop` picks them up on its next
 //     tick. The DB write throughput is what's measured here; the
 //     job-execution path is bounded by per-issuer concurrency
 //     (CERTCTL_RENEWAL_CONCURRENCY=25 default) and isn't usefully
 //     amplified by adding more inbound bulk-renew calls.
 //   - Full POST → poll deployments → cert-served loop. Same v1/v2
 //     deferral as the connector-tier scenarios — needs the agent
 //     poll surface plumbed end-to-end.
 //
 // Threshold contract:
 //   - p99 < 5s, p95 < 2s for the bulk-renew POST. Each call walks
 //     the criteria, materializes the matching managed_certificates
 //     rows, inserts N rows into `jobs`, and returns the envelope.
 //   - Error rate < 1%. Anything 4xx/5xx counts.
 //
 // Phase 8 reference:
 //   - Source finding: SCALE-H2.
 //   - Pre-state: only the API tier (50 req/s POST /certificates +
 //     GET /certificates) and connector tier (per-target handshake)
 //     were measured. The bulk-renew hot path was uncovered.
 //   - Seed: deploy/test/loadtest/seed/01_bulk_renewal_certs.sql
 //     creates 10K rows with tags.batch='bulk-renewal'. The seed
 //     must run before this scenario; the scale-seed compose
 //     profile gates this.
 import http from 'k6/http';
 import { check } from 'k6';
 import { textSummary } from 'https://jslib.k6.io/k6-summary/0.0.2/index.js';
 const BASE  = __ENV.CERTCTL_BASE  || 'https://localhost:8443';
 const TOKEN = __ENV.CERTCTL_TOKEN || 'load-test-token';
 // Sustained throughput target. constant-arrival-rate at 5 req/s for 5
 // minutes = 1500 bulk-renew POSTs. Each POST touches up to 10K
 // managed_certificates rows (criteria scan) + inserts up to 10K
 // rows into `jobs`, so the offered load is higher than the API
 // tier's 50 req/s on raw queries-per-second but the per-call
 // cost is larger.
 //
 // 5 req/s was picked deliberately:
 //   - 50 req/s combined with the API tier's 50 saturates the demo-
 //     scale compose's DB pool (CERTCTL_DATABASE_MAX_CONNS=50). The
 //     Phase 8 scenario should measure the per-call ceiling without
 //     fighting the pool.
 //   - Each call enqueues thousands of jobs; the scheduler's
 //     jobProcessorLoop has finite per-tick budget. Pushing higher
 //     than 5 req/s would queue work faster than the scheduler
 //     drains it, which produces a transient backlog metric (worth
 //     measuring eventually) but isn't what SCALE-H2 asks for.
 export const options = {
    scenarios: {
        bulk_renewal: {
            executor: 'constant-arrival-rate',
            rate: 5,
            timeUnit: '1s',
            duration: '5m',
            preAllocatedVUs: 10,
            maxVUs: 30,
            exec: 'bulkRenewal',
            tags: { scenario: 'bulk_renewal' },
        },
    },
    thresholds: {
        // Single-scenario threshold — narrower than the API tier
        // because each call is heavier (DB scan + N inserts).
        'http_req_duration{scenario:bulk_renewal}': ['p(99)<5000', 'p(95)<2000'],
        'http_req_failed{scenario:bulk_renewal}': ['rate<0.01'],
    },
    summaryTrendStats: ['avg', 'min', 'med', 'p(95)', 'p(99)', 'max'],
    insecureSkipTLSVerify: true,
 };
 export function bulkRenewal() {
    // Scope by team_id — the seed binds every loadtest cert to
    // t-platform; in a production-multi-tenant deploy, team scoping
    // is the typical bulk-renew shape. This exercises the criteria
    // walker AND the team-scoped permission check in the handler.
    //
    // NOTE: this does NOT include `tags` because the BulkRenewalCriteria
    // domain type (handler/bulk_renewal.go) only exposes profile_id,
    // owner_id, agent_id, issuer_id, team_id, certificate_ids — not
    // tag-based filtering. The team_id scope plus the production-
    // separated FK guarantees we only touch the Phase 8 seed.
    const payload = JSON.stringify({
        team_id: 't-platform',
        issuer_id: 'iss-local',
    });
    const res = http.post(`${BASE}/api/v1/certificates/bulk-renew`, payload, {
        headers: {
            'Content-Type': 'application/json',
            'Authorization': `Bearer ${TOKEN}`,
        },
        tags: { scenario: 'bulk_renewal' },
    });
    check(res, {
        'bulk-renew 2xx': (r) => r.status >= 200 && r.status < 300,
    });
 }
 export function handleSummary(data) {
    return {
        '/results/summary-bulk-renewal.json': JSON.stringify(data, null, 2),
        '/results/summary-bulk-renewal.txt': textSummary(data, { indent: ' ', enableColors: false }),
        stdout: textSummary(data, { indent: ' ', enableColors: true }),
    };
 }
@@ -0,0 +1,85 @@
 -- Phase 8 SCALE-H2: bulk-renewal scenario seed.
 --
 -- Generates 10,000 managed_certificates rows linked to the existing
 -- seed_demo.sql FKs (iss-local, o-alice, t-platform, rp-standard) so
 -- the bulk-renewal k6 scenario can POST /api/v1/certificates/bulk-renew
 -- against a fleet-scale dataset instead of the 15-row demo seed.
 --
 -- Behavior:
 --   - Idempotent. ON CONFLICT (name) DO NOTHING — re-running the seed
 --     against an already-seeded DB is a no-op.
 --   - expires_at is uniformly distributed across the next 30 days so
 --     a renewal_window_days = 30 policy considers every row eligible.
 --   - status = 'active' so the renewal selector treats them as
 --     live (the scheduler skips status IN ('pending', 'failed',
 --     'revoked', 'retired')).
 --   - name is generated as 'loadtest-bulk-NNNNN.example.test' for a
 --     stable, predictable identifier the k6 scenario can pattern-match
 --     to scope its criteria to the seeded set (the production fleet
 --     wouldn't share this prefix).
 --
 -- Volume target: 10,000 rows. Insert wall time on the loadtest stack
 -- (postgres:16-alpine, 2 CPU / 4 GiB): typically < 5 seconds via the
 -- single-statement generate_series + INSERT pattern below. The
 -- compose seed-init container runs this BEFORE the k6 driver starts,
 -- so the steady-state load measurement isn't affected by seed time.
 --
 -- Why not generated in Go via a fixtures helper:
 --   - The certctl-server boots from a clean DB and runs migrations +
 --     seed_demo.sql automatically when CERTCTL_DEMO_SEED=true. Adding
 --     a Go-side fixtures helper would require either (a) a new
 --     CERTCTL_LOADTEST_SEED flag wired into cmd/server/main.go (cross-
 --     cutting change for one test path) or (b) a separate seed binary
 --     (more compose surface). Raw SQL is the smallest viable change.
 --
 -- Phase 8 entry point — runs only when the loadtest compose stack is
 -- explicitly opted into the scale-seed via LOADTEST_SCALE_SEED=true.
 INSERT INTO managed_certificates (
    id,
    name,
    common_name,
    sans,
    environment,
    owner_id,
    team_id,
    issuer_id,
    renewal_policy_id,
    status,
    expires_at,
    tags,
    created_at,
    updated_at
 )
 SELECT
    'cert-loadtest-bulk-' || lpad(g::text, 5, '0'),
    'loadtest-bulk-' || lpad(g::text, 5, '0') || '.example.test',
    'loadtest-bulk-' || lpad(g::text, 5, '0') || '.example.test',
    ARRAY['loadtest-bulk-' || lpad(g::text, 5, '0') || '.example.test'],
    'loadtest',
    'o-alice',
    't-platform',
    'iss-local',
    'rp-standard',
    'active',
    -- Distribute expires_at uniformly across the next 30 days so a
    -- 30-day-window renewal policy sees every row as eligible.
    NOW() + ((g % 30) || ' days')::interval + ((g % 24) || ' hours')::interval,
    jsonb_build_object('source', 'loadtest-phase8', 'batch', 'bulk-renewal'),
    NOW(),
    NOW()
 FROM generate_series(1, 10000) AS g
 ON CONFLICT (name) DO NOTHING;
 -- Confirmation row count — the seed-init container greps this in its
 -- logs to verify the fleet shape post-insert. The output appears in
 -- `docker compose logs certctl-loadtest-scale-seed` after the run.
 DO $$
 DECLARE
    cert_count integer;
 BEGIN
    SELECT COUNT(*) INTO cert_count
    FROM managed_certificates
    WHERE name LIKE 'loadtest-bulk-%';
    RAISE NOTICE 'Phase 8 bulk-renewal seed: % managed_certificates rows present', cert_count;
 END $$;
@@ -0,0 +1,85 @@
 -- Phase 8 SCALE-H2: agent-fleet heartbeat-storm scenario seed.
 --
 -- Generates 5,000 agents rows so the heartbeat-storm k6 scenario can
 -- model a fleet-scale heartbeat pattern (5K agents heartbeating at the
 -- native 30s cadence = ~167 heartbeats/sec sustained) instead of the
 -- ~10-agent demo seed.
 --
 -- Behavior:
 --   - Idempotent. ON CONFLICT (id) DO NOTHING — re-runnable against an
 --     already-seeded DB.
 --   - name is unique (a UNIQUE constraint in migration 000001) so the
 --     name suffix mirrors the id suffix.
 --   - status = 'Online' so the heartbeat handler's retire-check
 --     (service.ErrAgentRetired) doesn't 410 the storm.
 --   - last_heartbeat_at staggered across the prior 60 seconds so the
 --     stale-agent reaper (agentHealthCheckLoop) doesn't immediately
 --     flip half the fleet to 'Offline' during the first scheduler
 --     tick of the load run.
 --   - api_key_hash = 'loadtest_no_auth'. The loadtest compose runs
 --     CERTCTL_AUTH_TYPE=api-key with a single static token
 --     (load-test-token), which bypasses per-agent key check the same
 --     way the existing API tier scenarios do. Production deploys with
 --     CERTCTL_AUTH_TYPE=agent-key per-agent would seed real bcrypt'd
 --     hashes; this column is opaque to the load-test path.
 --   - registered_at = NOW() - random 1-90 day interval so agent age
 --     looks realistic and any age-based query plans are exercised.
 --
 -- Volume target: 5,000 rows. The agents schema is much narrower than
 -- managed_certificates so the insert is sub-second on the loadtest
 -- stack. The 5K agents do not own any deployment_targets in this
 -- fixture (the scenario only measures the heartbeat hot path, not
 -- the work-poll path which depends on cert + target wiring).
 --
 -- Phase 8 entry point — runs only when the loadtest compose stack is
 -- explicitly opted into the scale-seed via LOADTEST_SCALE_SEED=true.
 INSERT INTO agents (
    id,
    name,
    hostname,
    status,
    last_heartbeat_at,
    registered_at,
    api_key_hash,
    os,
    architecture,
    ip_address,
    version
 )
 SELECT
    'ag-loadtest-' || lpad(g::text, 5, '0'),
    'loadtest-agent-' || lpad(g::text, 5, '0'),
    'loadtest-' || lpad(g::text, 5, '0') || '.fleet.example.test',
    'Online',
    -- Stagger last_heartbeat_at across the prior 60 seconds (= 2x the
    -- agent's native poll interval) so the first wave of incoming
    -- heartbeats doesn't all arrive in lockstep at t=0.
    NOW() - ((g % 60) || ' seconds')::interval,
    -- Registered_at randomized 1-90 days back.
    NOW() - ((g % 90 + 1) || ' days')::interval,
    'loadtest_no_auth',
    -- Mix linux/windows/darwin so the OS distribution column in the
    -- agents page isn't pure-linux during the storm.
    CASE (g % 10)
        WHEN 0 THEN 'windows'
        WHEN 1 THEN 'darwin'
        ELSE 'linux'
    END,
    -- amd64 dominates; arm64 minority.
    CASE WHEN (g % 5) = 0 THEN 'arm64' ELSE 'amd64' END,
    -- IPv4 in the 10.42.0.0/16 fleet range, deterministic per id.
    '10.42.' || ((g / 256) % 256)::text || '.' || (g % 256)::text,
    '2.1.0'
 FROM generate_series(1, 5000) AS g
 ON CONFLICT (id) DO NOTHING;
 DO $$
 DECLARE
    agent_count integer;
 BEGIN
    SELECT COUNT(*) INTO agent_count
    FROM agents
    WHERE id LIKE 'ag-loadtest-%';
    RAISE NOTICE 'Phase 8 agent-storm seed: % agents rows present', agent_count;
 END $$;
@@ -0,0 +1,87 @@
 # Phase 8 load-test seed fixtures
 Opt-in seed scripts that grow the loadtest DB from the demo-scale
 fixture (~15 certs / ~10 agents from `migrations/seed_demo.sql`) to
 fleet scale (10K certs + 5K agents) so the Phase 8 SCALE-H2 scenarios
 measure something representative.
 ## When these run
 The default `make loadtest` path does NOT touch this directory — the
 API tier and connector tier scenarios run against the demo seed alone
 and complete in ~5 minutes. The Phase 8 scenarios opt-in via the
 `LOADTEST_SCALE_SEED=true` environment variable; when set, the
 `certctl-loadtest-scale-seed` one-shot init container runs every
 `*.sql` file in this directory in lexical order against the same
 Postgres instance the server uses.
 Compose service wiring (see `../docker-compose.yml`):
 - Service: `scale-seed`
 - Profile: `scale-seed` (compose `profiles:` gate; not started by
  default)
 - Depends on: `postgres` (service_healthy) AND `certctl-server`
  (service_healthy — server runs schema migrations at boot so the
  seed runs AFTER tables exist)
 - Order: lexical (`01_bulk_renewal_certs.sql` then
  `02_agent_fleet.sql`)
 - Idempotent: every script uses `ON CONFLICT DO NOTHING` so re-running
  is a no-op.
 ## What gets seeded
 | File | Rows | Purpose |
 |---|---|---|
 | `01_bulk_renewal_certs.sql` | 10,000 managed_certificates | Fleet shape for `bulk_renewal.js`. All linked to demo FKs (iss-local, o-alice, t-platform, rp-standard). Status `active`, expires_at distributed across the next 30 days so a 30-day renewal window considers every row eligible. Name prefix `loadtest-bulk-` so the k6 scenario can scope its bulk-renew criteria. |
 | `02_agent_fleet.sql` | 5,000 agents | Fleet shape for `agent_storm.js`. Status `Online`, last_heartbeat_at staggered across prior 60s, name prefix `loadtest-agent-`. OS distribution: 80% linux / 10% windows / 10% darwin. Arch: 80% amd64 / 20% arm64. |
 ## How to run the Phase 8 scenarios locally
 ```bash
 cd deploy/test/loadtest
 LOADTEST_SCALE_SEED=true docker compose --profile scale-seed up --build \
    --abort-on-container-exit --exit-code-from k6-scale
 ```
 Or via the dedicated Makefile target (preferred for CI parity):
 ```bash
 make loadtest-scale
 ```
 ## Why SQL fixtures instead of a Go seed binary
 - The certctl-server already boots from a clean DB and runs migrations
  + `seed_demo.sql` when `CERTCTL_DEMO_SEED=true`. Adding a third seed
  mode (loadtest-scale) would mean either a new
  `CERTCTL_LOADTEST_SEED` flag wired into `cmd/server/main.go` (cross-
  cutting change for one test path) or a separate seed binary (more
  compose surface).
 - Raw SQL is the smallest viable change: each script is a single
  multi-row `INSERT … SELECT FROM generate_series(…)` plus a
  `DO $$ … RAISE NOTICE` confirmation block.
 - Idempotency is straightforward via `ON CONFLICT … DO NOTHING` — the
  same pattern `seed_demo.sql` uses.
 ## Why these volumes specifically
 - **10K certs.** The SCALE-H2 audit asked for "10K certs with
  renewal_at < now." Round number, fits in postgres:16-alpine on a
  CI runner without OOM, and large enough that the renewal selector's
  query plan is exercised (the demo's 15 rows would index-scan
  trivially).
 - **5K agents.** Heartbeat at 30s cadence = ~167 heartbeats/sec
  sustained. That's well above the 50 req/s the existing API tier
  measures and stresses the agent.heartbeat handler's per-call cost
  (last_heartbeat_at UPDATE + the RBAC permission check + the
  audit-log row).
 If a future scenario needs more rows (50K certs / 10K agents), add a
 new `03_…sql` here and another scenario file. Don't grow the existing
 files — re-running existing scenarios against a different fixture
 shape would invalidate the captured baseline.
 ## Phase 8 audit reference
 Source finding: SCALE-H2 in
 `cowork/certctl-architecture-diligence-audit.html`.
 Phase 8 closure commit: see `git log --grep='Phase 8'`.
@@ -55,6 +55,29 @@ This is the load-bearing two-person-integrity contract. Pinned by:
 - `internal/service/approval_test.go::TestApproval_Approve_RejectsSameActor` — service-level pin.
 - `internal/api/handler/approval_test.go::TestApproval_HandlerApproveAsSameActor_Returns403` — handler-level pin (HTTP 403 + body contains "two-person integrity").
 ## Enforcement invariants (COMP-006 closure)
 Acquisition-audit COMP-006 closure (Sprint 7 ACQ, 2026-05-16). The audit flagged COMP-006 as UNKNOWN because it couldn't independently verify that the approval workflow was bullet-tight — i.e., that a denied approval definitely results in NO certificate being signed, and an approved approval definitely lets the issuance proceed. This subsection documents the enforcement chain end-to-end and names the tests that pin each layer.
 **Layer 1 — Issuance gate.** `internal/service/certificate.go::CertificateService.Create` (around L341-373) reads `CertificateProfile.RequiresApproval`. When true, the created Job is stamped `JobStatusAwaitingApproval` (not `Pending`), AND a parallel `ApprovalRequest` row is created. The job processor never touches `AwaitingApproval` rows.
 **Layer 2 — Approval state machine.** `internal/service/approval.go::ApprovalService.Reject` and `Approve` flip the approval row + the job row atomically:
 - `Reject` → approval=`Rejected`, job=`Cancelled` (pinned by `internal/service/approval_test.go::TestApproval_Reject_TransitionsJobFromAwaitingApprovalToCancelled`).
 - `Approve` → approval=`Approved`, job=`Pending` (pinned by `TestApproval_Approve_TransitionsJobFromAwaitingApprovalToPending`).
 The "already terminal" guard (`TestApproval_Approve_RejectsAlreadyDecided`) prevents a rejected approval from later being flipped to approved.
 **Layer 3 — Job claim filter (the LOAD-BEARING SQL invariant).** `internal/repository/postgres/job.go::JobRepository.ClaimPendingJobs` (around L296-310) issues:
 ```sql
 SELECT ... FROM jobs WHERE status = $1
 ```
 with `$1 = JobStatusPending`. Cancelled jobs are therefore **never** returned to `ProcessPendingJobs`, so the certificate-issuance call path (the only path that signs certs) is unreachable for a denied approval. This SQL filter is the load-bearing "no cert if denied" enforcement — Layer 2 transitions the job to `Cancelled`, Layer 3 ensures `Cancelled` jobs are inert.
 **Composition pin.** `internal/service/approval_test.go::TestApproval_COMP006_DenyChainPinsNoCertIfRejected` and `TestApproval_COMP006_ApproveChainPinsJobReachesPending` re-attest the Layer-2-to-Layer-3 handoff in a single named test pair for future auditors. A refactor that, e.g., silently transitioned a denied approval's job to `Pending` instead of `Cancelled` would trip these tests before shipping.
 ## Operator playbook: "I need to approve a renewal"
 ```bash
@@ -0,0 +1,161 @@
 # Audit-trail tamper-evidence (audit_events hash chain)
 > Last reviewed: 2026-05-16
 Sprint 6 COMP-001-HASH closure. The `audit_events` table has two
 layered defenses against history rewrites:
 | Layer | Migration | What it blocks |
 |---|---|---|
 | **WORM trigger** | `000018_audit_events_worm.up.sql` | The application role cannot `UPDATE` or `DELETE` rows (tamper-**prevention**). |
 | **Hash chain** | `000047_audit_events_hash_chain.up.sql` | A compliance superuser (DB-superuser-equivalent) who bypasses the WORM trigger CAN still rewrite rows, but the rewrite is **detectable** — every subsequent `audit_events_verify_chain()` walk reports the first broken row's id + position (tamper-**evidence**). |
 This document covers the hash-chain layer. The WORM layer is
 documented inline in `migrations/000018_audit_events_worm.up.sql`.
 ## Why a hash chain in addition to WORM
 The WORM trigger documents (in its header comment) that a compliance
 superuser role exists by design — backup-restore, retention purges,
 and breach-recovery operators need a way through. Without a hash
 chain, that role can rewrite any row's `actor` / `action` / `details`
 content with no on-disk trace.
 HIPAA §164.312(b), FedRAMP AU-9, and NIST 800-53 AU-10 want
 tamper-**evidence**, not just tamper-prevention. The hash chain
 provides it: every row carries a `row_hash = sha256(prev_hash || id
 || actor || actor_type || action || resource_type || resource_id
 || details::text || timestamp_iso8601_utc || event_category)`, and
 the genesis row's `prev_hash` is `NULL`. Mutating any field in any
 row breaks the chain at that row's position; the verifier returns
 the first break.
 ## The verifier function
 `audit_events_verify_chain()` is a STABLE plpgsql function shipped
 in migration 000047. It walks every row in `(timestamp ASC, id ASC)`
 order, recomputes each row's expected hash, and returns:
 ```
 first_break_id  TEXT  -- NULL if the chain validated end-to-end
 first_break_pos INT   -- 0-indexed position of the first break
 row_count       INT   -- rows walked (= position + 1 on break, else table size)
 ```
 Call it directly from psql:
 ```sql
 SELECT first_break_id, first_break_pos, row_count FROM audit_events_verify_chain();
 ```
 ## Scheduled verification + Prometheus exposure
 The scheduler's `auditChainVerifyLoop` calls the verifier every
 `CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL` (default 6h) and writes the
 results into the `AuditChainCounter` instance shared with the
 metrics handler. Four metrics get exposed at
 `/api/v1/metrics/prometheus`:
 | Metric | Type | Meaning |
 |---|---|---|
 | `certctl_audit_chain_break_detected_total` | counter | Sticky once non-zero — the actionable alarm. |
 | `certctl_audit_chain_verify_total` | counter | Walks completed. Cross-check that the loop is alive. |
 | `certctl_audit_chain_rows` | gauge | Most recent walk's row count. |
 | `certctl_audit_chain_last_verified_at` | gauge | Unix seconds of most recent walk (0 = never). |
 The recommended alert rule is:
 ```
 ALERT AuditChainBreak
  IF certctl_audit_chain_break_detected_total > 0
  FOR 1m
  LABELS { severity = "page", category = "compliance" }
  ANNOTATIONS {
    summary = "audit_events hash chain break detected — investigate immediately",
    runbook = "<your-runbook-url>/audit-chain-break"
  }
 ```
 Cross-check `certctl_audit_chain_last_verified_at` (should advance
 roughly every `CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL`) and
 `certctl_audit_chain_verify_total` (should increment monotonically).
 A stalled `_verified_at` with an unchanged `_verify_total` means the
 scheduler loop has died — page on that too.
 ## Performance notes
 The walk is `O(N)` plpgsql over the `audit_events` table. On
 testcontainers + postgres:16-alpine the cost scales linearly:
 | Row count | Walk duration (approx) |
 |---|---|
 | 10k | < 50 ms |
 | 100k | < 500 ms |
 | 1M | 2-3 s |
 | 10M | 25-30 s |
 A 5-minute per-tick context timeout (in
 `internal/scheduler/scheduler.go::runAuditChainVerify`) bounds the
 worst case. Fleets with > 10M audit rows should consider:
 1. Lengthening `CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL` to 24h.
 2. Pre-aggregating older rows (out of scope today — would require a
   "chain checkpoint" concept that re-anchors the genesis hash to a
   snapshot's row_hash; future work if needed).
 ## What to do when a break is detected
 1. **Don't panic, don't auto-remediate.** The break is a forensic
   signal, not a self-healing event.
 2. **Capture the position + id.** The metric exposes both, but the
   sticky in-memory state (`AuditChainCounter.BrokenAtID`) only
   records the first break. SQL the verifier yourself to enumerate
   downstream breaks:
   ```sql
   SELECT first_break_id, first_break_pos, row_count FROM audit_events_verify_chain();
   ```
 3. **Snapshot the table.** `pg_dump --table=audit_events --data-only`
   to a chain-of-custody location. The next investigative step is
   recovering the original row content from the most recent backup
   that pre-dates the tampering — without this snapshot you can't
   tell which write order caused the divergence.
 4. **Audit the compliance-superuser credential trail.** The break
   implies someone with non-app DB credentials wrote to
   `audit_events`. Rotate the credential, investigate every recent
   session that authenticated under it, and review the WAL for the
   write.
 5. **Restore + cross-reference.** If you keep streaming WAL or
   periodic snapshots, restore a known-good snapshot to a sandbox
   and `EXCEPT`-diff the two `audit_events` tables to enumerate
   every mutated row.
 ## Backfill behavior
 Migration 000047 backfills existing `audit_events` rows in
 `(timestamp ASC, id ASC)` order during its transaction. The WORM
 trigger is temporarily `DISABLE`d for the duration; subsequent
 `ENABLE` is a no-op equivalent. The migration is idempotent — a
 re-run sees `row_hash IS NULL` rows as the only backfill targets, so
 already-hashed rows are not touched.
 Once backfill completes, `row_hash` becomes `NOT NULL`. `prev_hash`
 remains nullable so the genesis row (first row in the chain) stays
 representable.
 ## Operator configuration
 | Env var | Default | Notes |
 |---|---|---|
 | `CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL` | `6h` | Tick cadence for the scheduler's verify loop. Zero or negative is ignored. |
 ## See also
 - `migrations/000047_audit_events_hash_chain.up.sql` — migration source.
 - `migrations/000018_audit_events_worm.up.sql` — paired WORM trigger.
 - `internal/repository/postgres/audit_chain_test.go` — testcontainers integration tests.
 - `internal/repository/postgres/audit_worm_test.go` — WORM behaviour tests.
 - `internal/scheduler/scheduler.go::auditChainVerifyLoop` — scheduler loop.
 - `internal/service/audit_chain_metric.go` — `AuditChainCounter`.
 - `internal/api/handler/metrics.go` — Prometheus exposer.
@@ -300,6 +300,64 @@ constant, router-level no-rbacGate-wraps-protocol-paths).
  attacks where an attacker captures a logout JWT and replays it.
 - **Cache-Control: no-store** on the response per spec §2.5.
 ### Userinfo + BCL SSRF parity (post-SEC-001 follow-up)
 The original SEC-001 closure (Sprint 1, 2026-05-16) routed two OIDC
 discovery legs — `test_discovery.go` dry-run and `service.go` runtime
 provider load — through `validation.SafeHTTPDialContext` via the
 `SafeOIDCContext(ctx)` helper at
 [`internal/auth/oidc/safehttp.go`](../../internal/auth/oidc/safehttp.go).
 The acquisition-audit follow-up (2026-05-16) flagged two adjacent
 call sites the sweep missed; both are now wrapped identically.
 - **SEC-020 — Userinfo fallback (`fetchUserinfoGroups`).**
  `internal/auth/oidc/service.go` previously called
  `entry.provider.UserInfo(ctx, ts)` with the bare request context
  on the userinfo-fallback leg (operator opt-in when an IdP doesn't
  surface groups in the ID token). go-oidc/v3's `Provider.UserInfo`
  derives its `http.Client` from `ctx` via `getClient(ctx)`
  (`oidc.go:61-65`); without an override the internal `doRequest`
  falls through to `http.DefaultClient` — no SSRF guard, no DNS-
  rebinding re-resolve at dial time. An IdP whose discovery doc
  advertises a `userinfo_endpoint` pointing at a reserved address
  (loopback, cloud-metadata `169.254.169.254`, RFC 1918) would
  trigger an unguarded egress at userinfo-fetch time. Fixed by
  wrapping `ctx` via `SafeOIDCContext(ctx)` before both
  `oauthConfig.TokenSource` and `provider.UserInfo`. Pinned by
  `TestFetchUserinfoGroups_SSRF_BlocksReservedAddress`.
 - **SEC-021 — Back-channel logout discovery re-fetch.**
  `internal/api/handler/auth_session_oidc_bcl.go::Verify` performs
  a per-request `gooidc.NewProvider(ctx, matched.IssuerURL)` to
  fetch the JWKS for verifying the BCL token's signature. Same
  bare-ctx shape — an IdP whose registered `IssuerURL` resolves to
  a reserved address (or that is rebinding to one at logout time)
  would dial an unguarded HTTPS egress. Fixed by wrapping via
  `oidcsvc.SafeOIDCContext(ctx)` before `NewProvider`. Pinned by
  `TestDefaultBCLVerifier_SSRF_BlocksReservedAddress`.
 - **Context-key shape (why a single wrap covers both legs).**
  `gooidc.ClientContext` is implemented as
  `context.WithValue(ctx, oauth2.HTTPClient, client)` (go-oidc
  v3.18.0 `oidc.go:57-59`). Both go-oidc's `getClient` AND
  `golang.org/x/oauth2`'s `internal.ContextClient` read the same
  `oauth2.HTTPClient` key. So the single `SafeOIDCContext` wrap
  covers go-oidc-driven HTTP (Provider.UserInfo, NewProvider
  discovery, Verifier JWKS) AND oauth2-driven HTTP
  (Config.TokenSource refresh, Config.Exchange). No additional
  `context.WithValue(ctx, oauth2.HTTPClient, ...)` is required.
 - **Out-of-scope: RFC 1918.** Per the `IsReservedIP` policy
  documented at [`internal/validation/ssrf.go:15-32`](../../internal/validation/ssrf.go),
  RFC 1918 ranges are NOT treated as reserved by the SSRF guard.
  certctl is designed to manage certificates inside private
  networks; filtering 10/8 + 172.16/12 + 192.168/16 would break
  the primary use case. Operators on hosted IaaS who want
  RFC 1918 treated as reserved can opt in via the future
  `CERTCTL_BLOCK_RFC1918_OUTBOUND` toggle (see acquisition-audit
  Sprint 5 RED-005). The Sprint 1 SSRF parity fix above closes
  the loopback / link-local / cloud-metadata leg only.
 ### OIDC first-admin bootstrap
 - **Coexists with the env-var-token bootstrap path.** Both can be
@@ -94,6 +94,46 @@ helm upgrade certctl deploy/helm/certctl/ \
 Postgres state survives the upgrade (the PVC is retained). The server / agent images bump per the chart's `image.tag`. See [`docs/archive/upgrades/`](../archive/upgrades/) for version-specific upgrade guidance.
 ### 2026-05-16 — ServiceMonitor TLS default flipped (DEPL-004)
 Acquisition-audit DEPL-004 closure. Pre-2026-05-16, `monitoring.serviceMonitor.tlsConfig` was empty by default and the chart template fell through to an implicit `insecureSkipVerify: true`. Post-2026-05-16, the values.yaml default is a real TLS verify against the chart's CA (caFile + serverName matching the existingSecret mount path the chart's Prometheus integration produces).
 The new default works out of the box for the canonical install (the chart's `existingSecret` or cert-manager-emitted Secret mounted at `/etc/prometheus/secrets/certctl-ca/`):
 ```yaml
 # Default in values.yaml (no operator action required for the
 # canonical install path).
 monitoring:
  serviceMonitor:
    enabled: true
    tlsConfig:
      caFile: /etc/prometheus/secrets/certctl-ca/ca.crt
      serverName: certctl-server
 ```
 Operators whose Prometheus pod mounts the CA bundle at a different path override `caFile`:
 ```yaml
 monitoring:
  serviceMonitor:
    enabled: true
    tlsConfig:
      caFile: /path/to/your/ca.crt
      serverName: your-cert-CN
 ```
 Operators who genuinely need `insecureSkipVerify` (demo / dev clusters) must opt in **explicitly** — blanking the `tlsConfig` block trips the chart's `{{ fail }}` guard at render time:
 ```yaml
 monitoring:
  serviceMonitor:
    enabled: true
    tlsConfig:
      insecureSkipVerify: true
 ```
 There is no way to inherit the pre-2026-05-16 implicit-skipVerify behavior silently. Operators with `monitoring.serviceMonitor.enabled: false` (the chart default) need no action — the template short-circuits before the `tlsConfig` block.
 ## Configuration reference
 Every value is documented at `deploy/helm/certctl/values.yaml`. Common tweaks:
@@ -74,22 +74,55 @@ metric surface meet our SLO needs today" — not "is the right library
 under the hood." If the answer to the first question is yes, the
 second is a refactor, not a feature gap.
-## Tracing — explicitly not yet shipped
+## Tracing — OTLP surface available, instrumentation pending
-certctl does **not** ship distributed tracing instrumentation today:
+Sprint 6 ACQ DEPL-006 closure (2026-05-16) stood up the OTel tracer-
 provider surface. Operators with an OTel collector can opt in via:
- No OpenTelemetry SDK setup in `cmd/server/main.go`.
+```
- No OTLP exporter wired into outbound calls (issuer connectors,
+CERTCTL_OTEL_ENABLED=true
-  agent enrollment, etc.).
+OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.example.com:4318
- The `go.opentelemetry.io/otel` packages that appear in
+```
  [`go.mod`](../../go.mod) are indirect-only — they're transitive
  dependencies of `coreos/go-oidc` and similar.
-This is honest: there is no in-process tracing surface to monitor,
+When `CERTCTL_OTEL_ENABLED` is true, `cmd/server/main.go` calls
-correlate, or sample. If your environment requires end-to-end traces
+`internal/observability.Init` which:
-across the certctl control plane + agents + issuer backends, this is
+
-a gap you would close on the certctl side as part of a v3 work item.
+- Constructs an OTLP/HTTP exporter (chosen over OTLP/gRPC to keep
-Until then:
+  the dependency surface narrow — see `internal/observability/otel.go`
  header for the transport-choice rationale).
 - Registers a real `sdktrace.TracerProvider` as the otel global.
 - Honors the standard OTel env vars (`OTEL_EXPORTER_OTLP_ENDPOINT`,
  `OTEL_EXPORTER_OTLP_HEADERS`, `OTEL_EXPORTER_OTLP_INSECURE`,
  `OTEL_SERVICE_NAME` overrides the default `certctl-server`, etc.).
 - Defers a graceful shutdown that flushes the in-flight batcher.
 What this **does not** ship yet:
 - No per-handler / per-DB / per-connector span instrumentation in
  the certctl code base. The OTel SDK emits the spans it generates
  internally (process resource attributes, eventual stdlib HTTP
  spans), but certctl-domain spans (issuance, renewal, deployment,
  agent enrollment) are a v2.3 roadmap follow-up.
 - No tracing-correlated metric exemplars in the Prometheus
  histograms above. Those still ship the per-issuer latency signal
  without per-request fan-out.
 - No backwards-compat shim — operators who never set
  `CERTCTL_OTEL_ENABLED` (the default) see zero behavior change.
  The init returns a no-op shutdown so the deferred call is safe
  to invoke unconditionally.
 When this matters today:
 - Operators wiring up a v3 instrumentation effort have the OTel
  surface in place; they only need to add `tracer.Start(ctx, "…")`
  call sites in the handler/service code.
 - Operators evaluating certctl for acquisition / due-diligence see
  an opt-in OTel surface in the current release rather than a "v3
  roadmap item" — a useful signal for buyer credibility per the
  acquisition-thesis framing in `WORKSPACE-ROADMAP.md` §3.
 Existing correlation surfaces stay in place until span coverage
 ships:
 - Structured logs include a `request_id` you can correlate across
  the server log stream. See
@@ -99,8 +132,9 @@ Until then:
  same per-issuer latency signal a trace span would, just without
  the per-request fan-out.
-OpenTelemetry instrumentation is tracked in
+Per-handler / per-query / per-connector span instrumentation is
-[WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) as a v3 item.
+tracked in [WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) under
 §2 (NHI / Agent Identity, Phase 4 in the path-b build plan).
 ## Logging
@@ -121,52 +155,142 @@ explicitly scrubs the password before it reaches the audit subsystem
 (see [`docs/operator/auth-threat-model.md`](auth-threat-model.md) §
 "Break-glass token leak").
-## Rate-limit behavior under restarts and replicas
+## Rate-limit behavior — configurable backend (memory or postgres)
-Where rate limits exist, they are **per-process, in-memory,
+The sliding-window-log rate limiters used across certctl's
-reset-on-restart, and not shared across replicas**. This matters for
+authenticated-but-shared-credential code paths (break-glass login,
-multi-replica deployments and for any compliance posture that asks
+OCSP per-IP, cert-export per-actor, EST per-principal, EST
-"what limits apply globally vs per-pod."
+failed-basic source-IP) carry a **configurable backend**. The
 operator picks between two implementations via
 `CERTCTL_RATE_LIMIT_BACKEND`:
 | Value      | When to use                                          |
 |------------|------------------------------------------------------|
 | `memory`   | Default. Single-replica deploys; sketchpad / dev.    |
 | `postgres` | HA deploys (`server.replicas > 1`). Cross-replica-consistent. |
 Phase 13 Sprint 13.2/13.3 (architecture diligence audit ARCH-M1
 closure) replaced the prior single-process limitation with a
 substantive close: when the operator opts into `postgres`, all
 replicas share the same
 `rate_limit_buckets` table (migration 000046) and per-key access is
 arbitrated via `SELECT FOR UPDATE` row locks. A 3-replica cluster
 hitting one rate-limited endpoint concurrently sees exactly the
 configured cap succeed across the cluster — not 3× the cap as the
 old per-process backend would have allowed.
 ### Operator decision tree
 ```
 Single replica (server.replicas = 1, the helm chart default)?
  └─ Use CERTCTL_RATE_LIMIT_BACKEND=memory (the default; no action
     required). Bucket lookups stay in-process; zero DB round-trips
     on the hot path.
 Two or more replicas?
  └─ Use CERTCTL_RATE_LIMIT_BACKEND=postgres. Two extra DB round-trips
     per Allow call (BEGIN ... SELECT FOR UPDATE ... UPDATE ... COMMIT);
     acceptable on the gated hot path. The Sprint 13.2 multi-replica
     integration test pins exactly-cap enforcement across N replicas
     as the closure proof.
 ```
 ### Inventory
-| Limiter                                              | Scope                | Window | Cap                            | Survives restart? | Shared across replicas? |
+| Limiter                                              | Scope                | Window | Cap                            |
-|---|---|---|---|---|---|
+|---|---|---|---|
-| Break-glass login (per source-IP)                    | `internal/api/handler/auth_breakglass.go` | 60s   | 5 attempts                     | No                | No                      |
+| Break-glass login (per source-IP)                    | `internal/api/handler/auth_breakglass.go` | 60s   | 5 attempts                     |
-| SCEP/Intune per-device challenge                     | `internal/scep/intune/`                   | 60s   | configurable (`*_PER_MINUTE`)  | No                | No                      |
+| OCSP query (per source-IP)                           | `internal/api/handler/certificates.go`    | 60s   | configurable (`CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN`) |
-| EST per-principal CSR enrollment                     | `internal/est/`                           | 60s   | configurable                   | No                | No                      |
+| Cert export (per actor)                              | `internal/api/handler/export.go`          | 1h    | configurable (`CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR`) |
-| EST HTTP-Basic source-IP failed-auth                 | `internal/est/`                           | 60s   | configurable                   | No                | No                      |
+| EST per-principal CSR enrollment                     | `internal/api/handler/est.go`             | 24h   | configurable (per-profile `RateLimitPerPrincipal24h`) |
-| ACME per-account orders / key-change / challenge-respond | `internal/service/acme.go`            | 1h    | configurable                   | No                | No                      |
+| EST HTTP-Basic source-IP failed-auth                 | `internal/api/handler/est.go`             | 60m   | 10 attempts                    |
 | SCEP/Intune per-device challenge                     | `internal/scep/intune/`                   | 60s   | configurable (`*_PER_MINUTE`)  |
 | ACME per-account orders / key-change / challenge-respond | `internal/service/acme.go`            | 1h    | configurable                   |
-All five use the shared `internal/ratelimit/sliding_window.go`
+The `CERTCTL_RATE_LIMIT_BACKEND` selector applies to the first five
-primitive. Buckets live in a single per-process map guarded by a
+(the cmd/server-wired limiters). The SCEP/Intune wrapper + the ACME
-mutex; the package-level cap prevents unbounded growth under
+per-account limiter ride their own internal accounting today; both
-adversarial key cardinality (default 100,000 keys; oldest-by-newest-
+are tracked as follow-ups in WORKSPACE-ROADMAP.md.
 timestamp evicted under pressure).
-### Implications for multi-replica deployments
+### Backend internals
- **Effective per-replica cap is the documented cap.** A 2-replica
+Both backends share the algorithm: sliding-window log + per-key
-  deployment lets through up to 2× the per-key window cap before
+bucket + prune-on-Allow.
  either replica rejects.
 - **Restart resets the bucket.** A `kubectl rollout restart` empties
  the in-memory windows on every replica. An attacker who notices
  this could in principle re-issue burst attempts after every roll;
  the threat model accepts this because rollouts are operator-driven
  and the relevant endpoints already require credentials.
 - **No cross-replica fan-out.** Rate-limit decisions on replica A
  are not visible to replica B. Sticky-session ingress routing (with
  `service.spec.sessionAffinity: ClientIP` on Kubernetes or the
  equivalent on your load balancer) tightens the effective cap to
  per-replica + per-source-IP rather than per-replica + per-source-IP
  for whichever pod the request happened to land on.
-If your threat model requires globally-enforced rate limits across
+**Memory backend (`memory`)** — per-process map keyed by bucket key;
-replicas, the implementation surface is roughly: swap the per-process
+mutex-guarded; package-level LRU cap prevents unbounded growth under
-map for a database-backed sliding window (or a Redis-backed equivalent
+adversarial key cardinality (default 100,000 keys per limiter
-if you already run Redis). This is on the
+instance; oldest-by-newest-timestamp evicted under pressure).
-[WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) as a v3 item;
+Implemented at `internal/ratelimit/sliding_window.go`.
-nothing in the certctl threat model today requires it.
+
 **Postgres backend (`postgres`)** — same algorithm against the
 `rate_limit_buckets` table:
 ```sql
 CREATE TABLE rate_limit_buckets (
    bucket_key TEXT          PRIMARY KEY,
    timestamps TIMESTAMPTZ[] NOT NULL DEFAULT '{}',
    updated_at TIMESTAMPTZ   NOT NULL DEFAULT NOW()
 );
 ```
 `Allow(key, now)` opens a transaction, ensures the row exists
 (`INSERT ... ON CONFLICT DO NOTHING`), acquires the row lock
 (`SELECT ... FOR UPDATE`), prunes timestamps older than `now-window`,
 compares the post-prune count against `maxN`, conditionally appends
 `now`, persists, and commits. The row lock is what arbitrates across
 replicas: replicas A and B firing simultaneous `Allow("k")` never
 race because Postgres serializes the per-key row update across the
 cluster. Implemented at
 `internal/ratelimit/postgres_sliding_window.go`.
 ### Janitor sweep (postgres backend only)
 The scheduler runs a `rate_limit_buckets` janitor every
 `CERTCTL_RATE_LIMIT_JANITOR_INTERVAL` (default 5m, minimum 1m). The
 sweep deletes rows whose `updated_at` is older than the longest
 configured window any limiter uses (24h today, matching the EST
 per-principal limiter). Idempotent; repeated sweeps find zero rows.
 The memory backend's prune-on-Allow path keeps buckets short-lived
 without a separate sweep, so the loop is a no-op when
 `backend=memory`.
 ### Falsifiable closure proof
 The Phase 13 Sprint 13.2 integration test
 `internal/integration/ratelimit_multi_replica_test.go`
 (`//go:build integration`) fires 100 concurrent `Allow("test-key")`
 calls round-robined across 3 independent `PostgresSlidingWindowLimiter`
 instances sharing one Postgres database (`cap=10`, `window=1m`) and
 asserts exactly 10 succeed + 90 return `ErrRateLimited`. If the
 cross-replica row lock weren't arbitrating, each replica would
 independently let through ~3-4 requests, giving 12-15 successes
 total. Re-run:
 ```
 go test -tags=integration -count=1 -run TestRateLimit_MultiReplica \
    ./internal/integration/...
 ```
 ### Helm chart wiring
 The helm chart at `deploy/helm/certctl/` exposes the backend via
 `server.rateLimiting.backend` (default `memory`). To opt into the
 postgres backend for an HA deploy:
 ```
 helm upgrade --install certctl deploy/helm/certctl \
    --set server.replicas=3 \
    --set server.rateLimiting.backend=postgres \
    --set server.rateLimiting.janitorInterval=5m
 ```
 `server.replicas > 1` without flipping `backend` to `postgres` works
 fine — the limits stay per-process — but the operator gets a 2× /
 3× / Nx effective cap depending on replica count. The chart does NOT
 auto-flip on `replicas > 1` because some HA deploys deliberately want
 per-process limits (sticky-session ingress + tight per-replica caps
 to detect bot traffic at the edge before it hits the application).
 ### Where these numbers live
@@ -0,0 +1,136 @@
 # Privacy & retention (federated-user PII)
 > Last reviewed: 2026-05-16
 Sprint 6 COMP-002-RETENTION closure. certctl stores three categories
 of personally-identifiable information for federated humans (Auth
 Bundle 2 OIDC users):
 | Column | Source | Used by |
 |---|---|---|
 | `users.email` | IdP claim (`email`) | Operator GUI "find user by email", display in lists, audit attribution. |
 | `users.display_name` | IdP claim (`name`) | UI display string for the human. |
 | `users.oidc_subject` | IdP claim (`sub`) | Stable identifier — joined with `oidc_provider_id` in the (provider, subject) UNIQUE constraint. |
 Pre-fix, deactivating a user (admin-side `auth.user.deactivate`)
 soft-deleted the row by setting `deactivated_at` but left the PII
 columns populated indefinitely. The Sprint 6 fix adds an automatic
 purge pipeline.
 ## Retention pipeline shape
 ```
 Day 0   admin → POST /api/v1/auth/users/u-X/deactivate
                ├─ users.deactivated_at = NOW()
                └─ all active sessions for u-X revoked
 Day N   scheduler's userRetentionLoop tick (default cadence 24h)
        └─ UserRetentionService.PurgeDeactivatedUsers
           ├─ SELECT users WHERE deactivated_at < NOW() - retention_window
           ├─ For each row (batch-capped per tick):
           │     UserRetentionService.DeleteUserPII(u.id)
           │     ├─ revoke all active sessions (defense-in-depth)
           │     ├─ email        := "purged@redacted.local"
           │     ├─ display_name := "[purged]"
           │     ├─ oidc_subject := "sha256:" || hex(sha256(original))
           │     └─ audit_events row (action=user.purge_pii, category=auth)
 ```
 `users.id` is **preserved**. Historical `audit_events.actor = u-X`
 rows still resolve to the row (now scrubbed). This is the
 forensic-attribution guarantee — the operator can prove "user u-X
 performed action Y on date Z" even after the PII is gone.
 `oidc_subject` is **hashed**, not nullified, for two reasons:
 1. The `(oidc_provider_id, oidc_subject)` UNIQUE constraint would
   trip if multiple purged users converged on the same NULL.
 2. Re-login under the same IdP subject creates a fresh row (different
   `u-` id) because `GetByOIDCSubject` won't match the hashed token —
   the original subject is unrecoverable from the hash. This is the
   "right-to-be-forgotten" behavior: the same human logging back in
   is functionally a new account.
 ## Operator configuration
 | Env var | Default | Notes |
 |---|---|---|
 | `CERTCTL_USER_RETENTION_INTERVAL` | `24h` | Tick cadence for the scheduler's userRetentionLoop. Zero or negative ignored. |
 | `CERTCTL_USER_RETENTION_WINDOW` | `30 * 24h` (30 days) | How long after `deactivated_at` a row's PII stays in the table. Operators with stricter GDPR/CCPA expectations may shorten. |
 | `CERTCTL_USER_RETENTION_BATCH_CAP` | `200` | Per-tick row budget. Larger backlogs spread across multiple ticks. 0 = unbounded (test fixtures only). |
 ## How to verify retention is working
 1. Deactivate a test user via the admin path:
   ```bash
   curl -X POST -H "X-API-Key: $ADMIN_KEY" \
     https://certctl.example.com/api/v1/auth/users/u-test/deactivate
   ```
 2. Confirm the row's `deactivated_at` is set:
   ```sql
   SELECT id, email, deactivated_at FROM users WHERE id = 'u-test';
   ```
 3. Backdate `deactivated_at` to past the retention window (only for
   testing — never in production):
   ```sql
   UPDATE users SET deactivated_at = NOW() - INTERVAL '60 days'
   WHERE id = 'u-test';
   ```
   (Note: this UPDATE will succeed because `users` doesn't have a
   WORM trigger; the audit-events WORM trigger is unrelated.)
 4. Wait for the next `userRetentionLoop` tick (or restart the server
   to force an immediate sweep). Confirm scrub:
   ```sql
   SELECT id, email, display_name, oidc_subject
     FROM users
    WHERE id = 'u-test';
   ```
   Expected: `email = 'purged@redacted.local'`,
   `display_name = '[purged]'`,
   `oidc_subject LIKE 'sha256:%'`.
 5. Confirm an audit row was emitted:
   ```sql
   SELECT id, actor, action, resource_id, timestamp
     FROM audit_events
    WHERE action = 'user.purge_pii' AND resource_id = 'u-test'
    ORDER BY timestamp DESC LIMIT 1;
   ```
 ## What's NOT covered (deferred work)
 The Sprint 6 fix is Phase 1 of the audit's COMP-002-RETENTION
 recommendation. Two further pieces are forward-looking:
 - **GDPR data-subject access request (DSAR) export.** A "show me
  everything you know about me" endpoint is not yet implemented.
  Operators on EU-resident data should treat this as a manual SQL
  procedure today; track for Phase 2.
 - **Cascade purge of related rows.** Sessions are revoked (above);
  api_keys with `created_by = u-X` are NOT yet purged on scrub. The
  api_keys table doesn't have a foreign key to users (it indexes by
  `actor_id` strings, free-form), so the cascade is a service-layer
  concern that needs explicit wiring. Track for Phase 2.
 - **Per-event PII redaction in `audit_events.details`.** The existing
  `RedactDetailsForAudit` (`internal/service/audit_redact.go`) scrubs
  credential + PII keys at write time. A future feature for
  "retroactively re-redact existing rows" would interact with the WORM
  trigger; out of scope today.
 ## See also
 - `internal/service/user_retention.go` — `UserRetentionService` source.
 - `internal/scheduler/scheduler.go::userRetentionLoop` — scheduler loop.
 - `migrations/000036_users.up.sql` — `users` table definition.
 - `migrations/000045_users_deactivated_at.up.sql` — `deactivated_at` column.
 - `docs/operator/audit-chain.md` — paired Sprint 6 tamper-evidence work.
@@ -68,6 +68,45 @@ giving them the keys to the kingdom. The
 `internal/domain/auth/auditor_test.go` invariants pin this set going
 forward.
 ### Auditor role invariants (DOC-002 / COMP-005 closure)
 Acquisition-audit DOC-002 + COMP-005 closure (Sprint 7 ACQ, 2026-05-16).
 The auditor role's permission set is **pinned at exactly two
 permissions** — `audit.read` and `audit.export` — and any drift breaks
 the SOC 2 / FedRAMP / PCI separation. The pin is enforced at three
 layers and the load-bearing layer is the unit-test set, not a bash CI
 guard:
 1. **Schema layer** — `migrations/000029_rbac.up.sql:261-262` seeds
   exactly two `role_permissions` rows for `r-auditor`
   (`r-auditor / p-audit-read / global / NULL` and
   `r-auditor / p-audit-export / global / NULL`).
   `migrations/000039_audit_crit1_perms.up.sql:111` adds an inline
   comment confirming `r-auditor` was NOT widened by the migration that
   shipped the five admin-only fine-grained perms.
 2. **Code layer** — `internal/domain/auth/DefaultRoles[RoleIDAuditor]`
   matches the schema. A future code change that adds a non-audit
   permission to the slice is caught by:
 3. **Test layer** (the load-bearing one) —
   `internal/domain/auth/auditor_test.go` ships three pinning tests:
   - `TestAuditorRoleHoldsExactlyAuditReadAndExport` — set-equality
     comparison; fails on any add or remove
   - `TestAuditorRoleDoesNotHoldMutatingOrReadingNonAuditPerms` —
     enumerates the slice and rejects any permission outside the
     `{audit.read, audit.export}` set; catches subtle widening even if
     the set-equality test is bypassed
   - `TestAuditorRoleSeparateFromViewer` — pins that the auditor and
     viewer permission sets are disjoint except for `audit.read` (which
     viewer shares by design); catches the "auditor inherits viewer
     reads" leg
 A bash CI guard was deliberately **not** added — the property is
 already enforced at the Go test layer with stronger semantics
 (struct-aware set equality) than `grep` could provide. If a future
 contributor proposes widening `r-auditor`, the three tests above
 fail at `go test ./internal/domain/auth/...` BEFORE the change can
 land in a merge.
 The five **admin-only fine-grained perms** seeded by migration
 000030 gate the high-blast-radius endpoints:
@@ -0,0 +1,105 @@
 # Runbook: regenerating Playwright visual-regression snapshots
 > Last reviewed: 2026-05-16
 Use this when:
 - You've intentionally changed UI shape (added a column, restyled a
  banner, replaced an icon set) and the next `Frontend E2E` CI run
  fails with `Screenshot comparison failed:` errors on multiple
  `04-visual-regression.spec.ts` cases.
 - A deterministic-but-platform-specific font-rendering difference
  emerges (Linux runner vs your Mac dev box) and you want to refresh
  baselines from the canonical CI environment.
 TEST-003 closure (Sprint 5, 2026-05-16) flipped the workflow from
 `continue-on-error: true` to `false`. Pre-fix you could ignore a
 red E2E run and ship anyway. Post-fix the run blocks the merge, so
 any change that legitimately moves pixels needs the snapshot bump
 captured here.
 Do NOT use this to make a real visual regression disappear. The
 snapshots are version-controlled evidence — if a pixel diff fires
 unexpectedly, investigate the rendering change before bumping.
 ## What "snapshots" means here
 `web/playwright/04-visual-regression.spec.ts` calls
 `toHaveScreenshot()`. Playwright stores the canonical PNG at
 `web/playwright/04-visual-regression.spec.ts-snapshots/<test-name>-<browser>-<platform>.png`
 on first run. Subsequent runs compare pixel-by-pixel against that
 file. We commit the PNGs to git so the CI runner and local dev
 share a single source of truth.
 Two failure modes the diff is designed to catch:
 - **Intentional UI change.** You added a new field to the Targets
  table. The screenshot now has an extra column. The baseline
  doesn't. Pixel diff fires — this is the "operator updates
  baselines" path documented below.
 - **Regression.** A CSS change inadvertently shifted spacing.
  Investigate before regenerating; don't paper over the diff.
 ## Standard bump (one or two affected tests)
 1. Run the E2E suite locally with the update flag against the
   same Linux runner image Playwright uses:
   ```bash
   cd web
   npx playwright test 04-visual-regression.spec.ts --update-snapshots
   ```
   If you're on macOS, run it through Docker against the same image
   the workflow uses (`mcr.microsoft.com/playwright`); font
   rendering differs between platforms and Linux baselines must
   come from a Linux source.
 2. Inspect every regenerated PNG:
   ```bash
   git status web/playwright/*.spec.ts-snapshots/
   git diff --stat web/playwright/*.spec.ts-snapshots/
   ```
   PNG diffs in `git diff` are unhelpful — open the files in any
   image viewer and confirm the change matches your intent.
 3. Commit the snapshots alongside the source change in the same
   PR:
   ```bash
   git add web/playwright/*.spec.ts-snapshots/
   git commit -m "chore(e2e): refresh visual snapshots after <change>"
   ```
 4. Push and confirm CI's E2E job greens out.
 ## Mass bump (font upgrade, framework migration)
 Use the workflow's `workflow_dispatch` input to regenerate from
 CI's canonical environment:
 1. Go to `Actions` → `Frontend E2E` → `Run workflow`.
 2. Set `update_snapshots: true`.
 3. The workflow runs Playwright with `--update-snapshots`, then
   commits + pushes the regenerated PNGs to a feature branch
   `playwright/snapshot-update-<run-id>`.
 4. Open a PR from that branch to master. Review the PNG diffs in
   the PR view (GitHub renders image diffs side-by-side for
   committed PNGs).
 5. Merge.
 ## What NOT to do
 - Don't regenerate snapshots from a developer's local machine and
  push them as the canonical baseline. The Linux runner's font
  hinting differs from macOS / Windows, so the baselines must come
  from the same image the CI workflow runs.
 - Don't add `--update-snapshots` to the always-run e2e step in
  `.github/workflows/e2e.yml`. That's how snapshot regressions
  become invisible — every diff gets accepted, every PR ships
  fine, and the visual-regression layer becomes decorative.
 - Don't bump snapshots in a "fix typo" PR. Every PNG change is
  an architectural decision; pair it with the source change that
  justifies it.
@@ -1,6 +1,6 @@
 # Runbook: PostgreSQL backup for certctl
-> Last reviewed: 2026-05-13
+> Last reviewed: 2026-05-16
 Use this when:
 - You're setting up a new certctl deployment and need a backup policy
@@ -109,38 +109,76 @@ is the authoritative reference.
 ## Automation paths
-This is the gap an acquisition reviewer typically wants to see filled.
+certctl ships an **opt-in Helm CronJob** for the in-cluster-Postgres
-certctl ships no backup CronJob template in the Helm chart — the
+case (the most common bundled-deploy shape). The template lives at
-operator owns this layer because:
+`deploy/helm/certctl/templates/backup-cronjob.yaml` and is gated by
 `backup.enabled` in `values.yaml`. Default OFF; flip it on with one
 toggle and a sink choice. For managed Postgres (AWS RDS / GCP Cloud
 SQL / Azure DB) the operator relies on the provider's PITR layer;
 this CronJob is intentionally scoped to the in-cluster-Postgres path.
-1. The right tool depends on the deployment topology (in-cluster
+### Enabling the bundled CronJob
   Postgres vs. managed Postgres vs. self-hosted on a VM).
 2. The right secret-management integration depends on the operator's
   existing stack (Vault, AWS Secrets Manager, GCP Secret Manager,
   sealed-secrets, External Secrets).
 3. The right storage backend depends on the operator's existing
   off-host blob storage.
-A bundled CronJob would be a half-answer for any operator with an
+```bash
-established backup posture, and would have to be torn out before
+# PVC sink (in-cluster persistent volume — simplest)
-production. Three sample recipes that cover the common cases:
+helm upgrade --install certctl charts/certctl \
  --set backup.enabled=true \
  --set backup.sink=pvc \
  --set backup.pvc.storageClassName=<your-storage-class> \
  --set backup.pvc.size=20Gi \
  --set backup.schedule="0 2 * * *"
- **In-cluster Postgres → S3:** a CronJob running an alpine image with
+# S3 sink (off-cluster, recommended for any deploy past the lab)
-  `aws-cli` + the `pg_dump` command above, output piped to
+kubectl create secret generic certctl-backup-aws \
-  `aws s3 cp`. Cosign-signed if your supply-chain policy requires it.
+  --from-literal=AWS_ACCESS_KEY_ID=AKIA... \
- **Managed Postgres (AWS RDS / GCP Cloud SQL / Azure DB):** rely on
+  --from-literal=AWS_SECRET_ACCESS_KEY=... \
-  the cloud provider's built-in PITR backup; configure retention
+  --namespace certctl
-  ≥ 30 days; the certctl deployment surface is the connection string
+helm upgrade --install certctl charts/certctl \
-  alone.
+  --set backup.enabled=true \
- **Self-hosted VM:** systemd timer + `pg_dump` + `restic` (or
+  --set backup.sink=s3 \
-  `borgbackup`) to encrypted off-host storage.
+  --set backup.s3.bucket=my-certctl-backups \
  --set backup.s3.region=us-east-1 \
  --set backup.s3.credentialsSecret=certctl-backup-aws \
  --set backup.schedule="0 2 * * *"
 ```
-Tracked in [WORKSPACE-ROADMAP.md](../../../WORKSPACE-ROADMAP.md) as a
+The CronJob runs `pg_dump --format=custom --no-owner --no-acl
-post-v2.1.0 nice-to-have: an opt-in Helm CronJob template for the
+--dbname=certctl` (the same shape as the manual command earlier in
-in-cluster-Postgres-to-S3 case as a starter. The right time to ship
+this runbook, so a manual dump and a Job dump are byte-comparable)
-it is when a real operator asks for it; speculatively shipping it
+and ships the artifact to the configured sink. Off-host retention
-without that signal would just produce a template every deployment
+is the sink's responsibility — S3 lifecycle rules or PVC snapshot
-ends up rewriting.
+retention on the storage class, not the CronJob.
 ### When the bundled CronJob is NOT the answer
 - **Managed Postgres (AWS RDS / GCP Cloud SQL / Azure DB).** Use the
  provider's built-in PITR; configure retention ≥ 30 days. The
  certctl deployment surface is the connection string alone — no
  CronJob to run.
 - **Self-hosted Postgres on a VM (no Kubernetes).** Use a systemd
  timer + `pg_dump` + `restic` (or `borgbackup`) to encrypted
  off-host storage. The bundled CronJob has no equivalent on bare
  VMs.
 - **Already running pgbackrest / wal-g.** Keep using it. The bundled
  CronJob is for the operator who doesn't yet have a backup posture,
  not a replacement for production-grade WAL-shipping.
 ### Recovery objectives
 The bundled CronJob targets the same RPO/RTO that any nightly-dump
 strategy gives you:
 - **RPO ≈ 24h** at the default `0 2 * * *` schedule (you lose at
  most one day of writes if Postgres burns down). Tighten by running
  every 6h or 1h; tighten further by switching to WAL-shipping
  (out of scope for the bundled CronJob).
 - **RTO ≈ 30–60min** for the restore drill below — drop the dump
  into a fresh Postgres instance, point certctl at it, confirm
  routes return 200. Empirically measured during the
  [disaster-recovery runbook](disaster-recovery.md) drill.
 If your contractual RPO is below 24h, run pgbackrest WAL-shipping
 alongside (or instead of) the CronJob.
 ## Verification — what to dry-run quarterly
@@ -160,6 +198,42 @@ to your quarterly on-call rotation:
 The [disaster-recovery runbook](disaster-recovery.md) covers what to
 do when this dry-run reveals a gap.
 ## CI restore verification
 > Acquisition-audit DEPL-005 + DATA-012 closure (Sprint 4 ACQ,
 > 2026-05-16). The quarterly dry-run above is the operator-side
 > proof; the workflow below is the upstream-side proof.
 The certctl repo ships a weekly GitHub Actions workflow that
 exercises the **exact** pg_dump shape this runbook recommends
 (`--format=custom --no-owner --no-acl`) against a real Postgres
 container, then asserts the audit_events hash chain round-trips
 byte-for-byte across the dump → restore boundary. A regression in
 the dump format, in a Postgres minor bump, or in migration 000047's
 canonical-payload serialization would surface in the next Monday
 run instead of on a customer's restore day.
 - **Workflow:** [`.github/workflows/backup-restore.yml`](../../../.github/workflows/backup-restore.yml)
  — Mondays 07:00 UTC + `workflow_dispatch`. Postgres service
  container pinned to the same SHA256 digest as
  `deploy/docker-compose.yml`.
 - **Harness:** [`deploy/test/backup-restore-smoke.sh`](../../../deploy/test/backup-restore-smoke.sh)
  — runs the workload → `pg_dump -Fc` → `DROP SCHEMA public CASCADE`
  → `pg_restore` → verify cycle. Locally runnable against any
  reachable Postgres (it DROPs the schema, so do not point it at
  data you care about).
 - **Workload + verifier:** [`deploy/test/backupsmoke/main.go`](../../../deploy/test/backupsmoke/main.go)
  — generates 24 synthetic `audit_events` rows representing an
  issue/renew/revoke/auth-login cycle, snapshots the chain head
  before the backup, and after restore runs
  `audit_events_verify_chain()` to confirm `first_break_id IS NULL`.
 The CI workflow is not a replacement for the quarterly operator
 dry-run — it does not exercise the operator-managed file material
 (CA keys, RA keys, trust anchors) listed in the "What to back up"
 table above. Treat it as the dump-shape regression test; the
 quarterly run remains the full-restore correctness test.
 ## Related reading
 - [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) — the restore companion
@@ -0,0 +1,243 @@
 # Runbook: Prometheus bearer token for the metrics scrape endpoint
 > Last reviewed: 2026-05-14
 Use this when:
 - You're enabling Prometheus Operator scraping via the Helm chart's
  `monitoring.serviceMonitor.enabled` toggle.
 - Your Prometheus scrapes are returning 401 against
  `/api/v1/metrics/prometheus`.
 - An auditor asks "how is the metrics endpoint authenticated?"
 ## The constraint
 The certctl server exposes Prometheus metrics at
 `/api/v1/metrics/prometheus`. This endpoint is **RBAC-gated on the
 `metrics.read` permission** (per `internal/api/router/router.go`).
 Like every other gated handler, it requires an authenticated actor
 holding that permission — there is no anonymous-scrape path.
 The rationale: the metrics payload includes operational counters
 (cert counts by status, agent counts, issuance failure rates) that
 a public-facing observer should not see. Most certctl deployments
 expose a reverse proxy / load balancer to the wider network; the
 auth gate on `/api/v1/metrics/prometheus` prevents an external
 observer from learning operational state via the metrics endpoint
 even when the proxy itself is reachable.
 ## What you need to set up
 Three pieces:
 1. **An API key with `metrics.read` permission** (and only that
   permission — least-privilege).
 2. **A Kubernetes Secret** holding that API key.
 3. **`monitoring.serviceMonitor.bearerTokenSecret`** in the chart's
   values pointing at the Secret.
 ## Step 1: Create the metrics-read role + API key
 The chart's seed migration ships a `metrics-read` role-template, but
 some operators want a dedicated identity per scrape source. Both
 approaches work; the dedicated-identity path is below.
 ```bash
 # 1. Bootstrap or impersonate a session with auth.role.assign +
 #    auth.apikey.create permissions (admin actor is fine).
 # 2. Create a role with only metrics.read.
 curl -sS --cacert ./ca.crt -X POST \
  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
  -H "Content-Type: application/json" \
  https://certctl.your-org.example/api/v1/auth/roles \
  -d '{"id":"r-prometheus-scrape","name":"Prometheus scrape","permissions":["metrics.read"]}'
 # 3. Create an actor that holds the role.
 curl -sS --cacert ./ca.crt -X POST \
  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
  -H "Content-Type: application/json" \
  https://certctl.your-org.example/api/v1/auth/actors \
  -d '{"id":"actor-prometheus","name":"Prometheus scrape","roles":["r-prometheus-scrape"]}'
 # 4. Mint an API key for the actor. The response includes a
 #    `key_value` field that's only returned ONCE — capture it.
 curl -sS --cacert ./ca.crt -X POST \
  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
  -H "Content-Type: application/json" \
  https://certctl.your-org.example/api/v1/auth/apikeys \
  -d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token"}' \
  | tee /tmp/prom-key.json
 # Extract just the secret material:
 jq -r '.key_value' /tmp/prom-key.json
 ```
 The mint endpoint returns the API key plaintext exactly once. The
 server stores only a constant-time-comparable hash; if you lose the
 key value, mint a new one.
 ## Step 2: Create the Kubernetes Secret
 ```bash
 NAMESPACE=certctl
 API_KEY=$(jq -r '.key_value' /tmp/prom-key.json)
 kubectl create secret generic certctl-prometheus-key \
  -n "$NAMESPACE" \
  --from-literal=api-key="$API_KEY"
 ```
 Now scrub the temporary file:
 ```bash
 shred -u /tmp/prom-key.json
 ```
 ## Step 3: Wire the Secret into the chart values
 In your `values.yaml` (or `--set` overrides):
 ```yaml
 monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 30s
    scrapeTimeout: 10s
    bearerTokenSecret:
      name: certctl-prometheus-key
      key: api-key
 ```
 Re-apply the chart:
 ```bash
 helm upgrade certctl . -n "$NAMESPACE" --reuse-values
 ```
 The rendered ServiceMonitor will now include the `bearerTokenSecret`
 block. Prometheus Operator's reconciler picks it up and injects the
 bearer token into the scrape request.
 ## Verification
 ```bash
 # 1. Confirm the ServiceMonitor renders with the secret reference
 kubectl get servicemonitor -n "$NAMESPACE" certctl-server -o yaml \
  | grep -A2 bearerTokenSecret
 # Expected:
 #       bearerTokenSecret:
 #         name: certctl-prometheus-key
 #         key: api-key
 # 2. Tail the certctl-server logs for the next ~60 seconds (one
 #    Prometheus scrape interval). Look for incoming GET /metrics/prometheus
 #    requests authenticated successfully — no 401s.
 kubectl logs -n "$NAMESPACE" -l app.kubernetes.io/component=server \
  --tail=100 -f | grep -E "GET /api/v1/metrics/prometheus|metrics-scrape"
 # 3. From the Prometheus UI's "Targets" page, the certctl-server
 #    target should be UP and last-scrape-error empty. If it's
 #    showing 401, the bearer token isn't reaching the request — see
 #    troubleshooting below.
 ```
 ## Troubleshooting
 ### Prometheus target shows 401
 Three possible causes:
 1. **Wrong Secret name / key.** Run
   `kubectl get secret -n "$NAMESPACE" certctl-prometheus-key -o yaml`
   and confirm the `data.api-key` field exists with a base64-encoded
   non-empty value. The Secret's data field name must match the
   `bearerTokenSecret.key` value in `monitoring.serviceMonitor`.
 2. **API key doesn't have `metrics.read`.** Hit the gating endpoint
   manually from inside the cluster with the same key:
   ```bash
   kubectl run --rm -it --image=curlimages/curl debug -- \
     curl -sS -H "Authorization: Bearer <API_KEY>" \
     https://certctl-server.certctl.svc.cluster.local:8443/api/v1/metrics/prometheus
   ```
   A 401 here means the role doesn't include `metrics.read`. A 403
   means the role exists but the API key isn't assigned to it.
 3. **TLS verification failure (not a 401, but masquerading as one in
   Prometheus's logs).** The default ServiceMonitor template sets
   `insecureSkipVerify: true` to support demos — production deploys
   should set `tlsConfig.caFile` or `tlsConfig.ca.secret` per the
   ServiceMonitor docs.
 ### Prometheus target shows TLS errors
 `monitoring.serviceMonitor.tlsConfig` overrides the default. Three
 patterns:
 ```yaml
 # Pattern 1: trust the system CA bundle (production behind a real CA)
 tlsConfig:
  caFile: /etc/ssl/certs/ca-certificates.crt
  serverName: certctl.your-org.example
 # Pattern 2: trust a CA from a Secret mounted by Prometheus Operator
 tlsConfig:
  ca:
    secret:
      name: certctl-ca
      key: ca.crt
  serverName: certctl.your-org.example
 # Pattern 3: skip verification (DEMO ONLY — DO NOT USE IN PRODUCTION)
 tlsConfig:
  insecureSkipVerify: true
 ```
 The certctl server's self-signed bootstrap cert (default
 `server.tls.existingSecret` from the chart) presents a CN of
 `certctl-server`. If your `serverName` doesn't match, the scrape
 fails with `x509: certificate is valid for certctl-server, not ...`.
 ## Rotation
 API keys are constant-time-compared, stored hashed, and never
 logged. Rotation:
 ```bash
 # 1. Mint a new key (same actor + role)
 curl -sS --cacert ./ca.crt -X POST \
  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
  -H "Content-Type: application/json" \
  https://certctl.your-org.example/api/v1/auth/apikeys \
  -d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token-v2"}' \
  | tee /tmp/prom-key-new.json
 # 2. Update the Secret in place
 kubectl create secret generic certctl-prometheus-key \
  -n certctl \
  --from-literal=api-key="$(jq -r '.key_value' /tmp/prom-key-new.json)" \
  --dry-run=client -o yaml | kubectl apply -f -
 # 3. Wait one scrape interval; verify the next scrape uses the new key.
 # 4. Revoke the old key
 curl -sS --cacert ./ca.crt -X DELETE \
  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
  https://certctl.your-org.example/api/v1/auth/apikeys/<OLD_KEY_ID>
 # 5. Scrub the temp file
 shred -u /tmp/prom-key-new.json
 ```
 Prometheus Operator picks up Secret changes automatically — no
 ServiceMonitor edit needed, no Prometheus restart.
 ## Related reading
 - [`docs/operator/rbac.md`](../rbac.md) — the full RBAC primitive,
  permission catalogue, and role-assignment workflow.
 - [`docs/operator/security.md`](../security.md) — the broader auth
  posture including the API key / OIDC / break-glass paths.
 - [`docs/operator/auth-threat-model.md`](../auth-threat-model.md) —
  why `/api/v1/metrics/prometheus` is gated, and what an
  unauthenticated leak of metrics data would reveal.
@@ -0,0 +1,193 @@
 # Runbook: Helm rollback for certctl
 > Last reviewed: 2026-05-14
 Use this when:
 - A `helm upgrade` rolled out a bad release and the operator wants to
  return to the previous working state.
 - A schema migration shipped a change the operator wants to back out.
 - An emergency change needs reverting and forward-fix isn't yet
  available.
 This page covers `helm rollback` mechanics + the cases where
 rollback is NOT enough on its own (schema migrations are the main
 one).
 ## What `helm rollback` does
 `helm rollback <release> [revision]` re-applies the manifests from a
 previous Helm revision. It re-creates / updates Kubernetes objects to
 match that revision's template output and is safe for:
 - **Deployment image bumps:** rolls the container image back to the
  previous tag. Pods restart with the old image.
 - **ConfigMap / Secret content changes:** old values land in the
  config; pods that consume them via `envFrom` or volume mounts get
  the prior values on the next restart.
 - **Resource requests / limits / replica count:** the spec changes
  back to the prior values. Kubernetes reschedules pods accordingly.
 - **Service / Ingress / NetworkPolicy changes:** networking flips
  back to the previous shape immediately.
 ## What `helm rollback` does NOT do
 The Kubernetes layer is reversible; the **database schema is not**.
 This is the single most common gap in a rollback plan.
 ### Schema migrations are forward-only by design
 certctl's migrations under `migrations/` are numbered up-migrations
 (`NNNNNN_*.up.sql`) with paired down-migrations
 (`NNNNNN_*.down.sql`) shipped alongside. The `postgres.RunMigrations`
 path applied at server boot only runs the `*.up.sql` files. The
 `*.down.sql` files exist for development reference + a hypothetical
 "surgical revert" path but are **not invoked by `helm rollback`**.
 The implication: if `v2.1.0 → v2.2.0` ships migrations 000100,
 000101, 000102 (adding columns, changing constraints, dropping
 indexes), then `helm rollback` to v2.1.0 takes you back to the v2.1.0
 container image — but the database still has migrations 000100-102
 applied. The v2.1.0 server code doesn't know about those columns; it
 either ignores them (best case) or fails to start (if the schema
 diverged in a way the older code can't tolerate).
 ### When is rollback safe without a schema revert?
 Migrations are **additive-only** in 90%+ of cases. The categories:
 | Migration class | Safe to roll back without schema revert? | Why |
 |---|---|---|
 | Add column with default | Yes | Old code ignores the new column |
 | Add table | Yes | Old code doesn't reference the table |
 | Add index | Yes | Old code doesn't depend on the index existing |
 | Add CHECK / FOREIGN KEY constraint | Usually yes | Only fails on row data inserted by new code that violates the old code's constraints |
 | Rename column / table | NO | Old code's queries reference the original name |
 | Drop column / table | NO (data loss) | New code already stopped writing the column; old code expects it |
 | Type change (`VARCHAR(40)` → `TEXT`) | Usually yes | Old code's column read still works |
 | Backfill a column | Yes | Old code ignores the backfilled value |
 If your upgrade only added columns / tables / indexes, `helm
 rollback` is sufficient. If it renamed or dropped anything, you need
 a database-level revert.
 ## Procedure: standard rollback (additive-only migrations)
 ```bash
 # 1. Identify the target revision
 helm history certctl -n <namespace>
 # 2. Take a backup BEFORE rolling back (defense in depth — if
 #    rollback exposes a data corruption issue, restore is the only
 #    path back)
 #    See docs/operator/runbooks/postgres-backup.md for the canonical
 #    pg_dump invocation.
 # 3. Roll back to the chosen revision
 helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
 # 4. Verify
 kubectl get pods -n <namespace> -l app.kubernetes.io/instance=certctl
 kubectl logs -n <namespace> -l app.kubernetes.io/component=server --tail=50
 ```
 Watch for migration-version mismatch warnings in the server logs. If
 the older server code refuses to start because the schema is ahead
 of what it knows about, escalate to "rollback with schema revert."
 ## Procedure: rollback with schema revert
 This is the rare case. Use it when:
 - A column / table was renamed or dropped in the rolled-up release.
 - The older code refuses to start with the newer schema.
 ```bash
 # 1. Take a fresh backup right NOW (the current schema is what we're
 #    reverting from; if anything goes wrong we want a clean
 #    forward-recovery option)
 kubectl exec -n <namespace> statefulset/certctl-postgres -- \
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
  > "certctl-pre-rollback-$(date -u +%Y%m%dT%H%M%SZ).dump"
 # 2. Stop the server Deployment to prevent it from writing to the
 #    database during the revert
 kubectl scale deploy/certctl-server -n <namespace> --replicas=0
 # 3. Apply the relevant *.down.sql files manually, one at a time, in
 #    reverse migration-number order. Example for reverting two
 #    migrations:
 NEW=000102  # newest migration on the running schema
 OLD=000100  # oldest migration to revert (inclusive)
 for MIG in 000102 000101 000100; do
  kubectl exec -i -n <namespace> statefulset/certctl-postgres -- \
    psql --user=certctl --dbname=certctl \
    < migrations/${MIG}_*.down.sql
 done
 # 4. Manually update the schema_migrations table to reflect the
 #    reverted state (the migration runner's bookkeeping)
 kubectl exec -n <namespace> statefulset/certctl-postgres -- \
  psql --user=certctl --dbname=certctl -c \
  "DELETE FROM schema_migrations WHERE version > $((OLD - 1));"
 # 5. NOW run helm rollback. The server pod will start with a schema
 #    that matches its code.
 helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
 ```
 The `*.down.sql` files are tested but only against pristine schemas —
 they may not handle every data shape a production database
 accumulates. ALWAYS take a backup first; the down-migrations are
 a recovery tool, not a transactional contract.
 ## Procedure: full restore (when revert isn't tractable)
 When a down-migration would lose data (drop columns / tables that
 hold rows the older code can't read but the newer code populated), a
 full restore is the only safe path. This is the procedure described
 in
 [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md#postgres-restore).
 The summary:
 1. Stop certctl.
 2. Take a backup of the CURRENT schema (defense in depth).
 3. Restore the LAST backup taken BEFORE the bad upgrade.
 4. Roll the Helm release back to the matching code version.
 5. Restart certctl.
 6. Re-run any audited writes that happened in the window between the
   backup and the bad upgrade (read the audit log; the API surface
   is recoverable).
 The DR runbook owns the canonical commands.
 ## Common pitfalls
 - **Forgetting the backup before rollback.** A schema-revert path is
  not safe without a fresh backup. If something goes wrong mid-revert
  and your most recent backup is from last night, you've lost any
  cert-issuance history between then and now.
 - **Rolling back the chart without rolling back the database state**
  on a release that included a destructive migration (drop column,
  drop table). Symptoms: old code starts, queries fail with
  "column does not exist," server crashes in a loop. Recovery
  requires schema revert OR full restore.
 - **Letting the agents drift.** `helm rollback` updates the agent
  DaemonSet's image too — agents on different versions than the
  server may produce incompatible CSR payloads. After rollback,
  confirm agent images are at the matching version via
  `kubectl get daemonset certctl-agent -o jsonpath='{.spec.template.spec.containers[0].image}'`.
 - **GHCR images pinned by digest:** the rollback restores the prior
  `image:` value from the Helm template. If your operator workflow
  uses `image.digest` pinning, the digest comes back too — make
  sure that digest still exists on ghcr.io. They do persist; old
  tags are never deleted, but a private mirror may have garbage-collected.
 ## Related reading
 - [`docs/operator/runbooks/postgres-backup.md`](postgres-backup.md) —
  the backup procedure that's the precondition for any
  schema-revert path.
 - [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) —
  the full restore procedure when rollback isn't tractable.
 - [`docs/migration/api-keys-to-rbac.md`](../../migration/api-keys-to-rbac.md) —
  example of a migration that the runtime supports rolling back via
  feature flag (rare).
@@ -0,0 +1,123 @@
 # Scale baseline — 2026 Q2 canonical-hardware capture
 > Last reviewed: 2026-05-16
 ## What this file is
 The canonical record of certctl's load-test baselines for the
 2026-Q2 reporting window. TEST-005 closure (Sprint 5, 2026-05-16)
 introduces this doc as the single source of truth for "what's the
 scale ceiling?" — replacing the TBD-laden table at
 [`docs/operator/scale.md`](scale.md#measured-baseline) that had been
 unfilled since the scenarios shipped in Phase 8.
 The numbers below come from the `loadtest` GitHub Actions workflow
 running its three canonical scenarios on `ubuntu-latest` runners:
 - `bulk-renewal` — 10,000-cert seed + criteria-mode
  `POST /api/v1/certificates/bulk-renew`, 200 concurrent VUs over 10
  minutes.
 - `acme-burst` — 200 concurrent VUs hitting `/acme/directory`,
  `/acme/new-nonce`, and `/acme/renewal-info/<cert-id>` simultaneously.
 - `agent-storm` — 5,000-agent seed + sustained
  `POST /api/v1/agents/{id}/heartbeat` at 167 RPS.
 Thresholds enforced inline in `deploy/test/loadtest/k6.js` (p99 < 5s
 for issuance-acceptance, p99 < 2s for list, error rate < 1%). k6 exits
 non-zero on any breach, which propagates through `docker compose up
 --exit-code-from k6 → make loadtest → workflow exit`.
 ## Capture procedure
 1. Trigger the workflow:
   - **Actions** → `loadtest` → **Run workflow**, branch `master`.
   - Wait ~25 minutes for the three matrix legs to finish.
 2. Download each scenario's artifact from the workflow run page:
   - `k6-scale-bulk-renewal-<run-id>`
   - `k6-scale-acme-burst-<run-id>`
   - `k6-scale-agent-storm-<run-id>`
   - Each archive contains the k6 `summary.json` + raw NDJSON
     points (90-day GHA retention).
 3. Run `scripts/scale-baseline/extract.sh <run-id>` (see below) to
   pull the three artifacts and emit the table rows for this doc.
 4. Paste the rows under the **Latest capture** section. Update
   `> Last reviewed:` to today.
 5. Commit the artifacts you want long-term-retained to
   [`deploy/test/loadtest-artifacts/`](../../deploy/test/loadtest-artifacts/)
   using `git lfs` if the archives exceed 100 MB; otherwise commit
   them inline.
 ## Latest capture
 | Scenario | Run ID | Date | p50 | p95 | p99 | Error rate | Peak server RSS | Notes |
 |---|---|---|---|---|---|---|---|---|
 | **bulk-renewal** | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | First post-TEST-005 capture; trigger via workflow_dispatch + extract via the procedure above. |
 | **acme-burst** directory | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
 | **acme-burst** new-nonce | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
 | **acme-burst** renewal-info | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
 | **agent-storm** | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
 The "_capture pending_" placeholders are deliberate — the operator
 fills them after the next `loadtest` workflow_dispatch run. Once
 filled, replace these rows; do not edit them in place across runs
 (the historical row stays as evidence).
 ## Why "ubuntu-latest" instead of RDS-shaped hardware
 The audit's fix language preferred RDS-shaped Postgres on a
 fixed-spec runner. ubuntu-latest's 2-vCPU / 7-GB-RAM shape is
 narrower than typical production Postgres, but it has two virtues:
 1. **Reproducibility.** Every operator + acquirer can reproduce the
   numbers; an RDS-shaped Postgres requires a paid AWS account.
 2. **Conservative ceiling.** If the published numbers come from a
   constrained runner, real-world deployments on production Postgres
   sizes (db.m5.large +) only get better.
 When an acquirer or operator asks for a production-equivalent
 baseline, capture a second run on whatever infrastructure they want
 to validate against and add it under a new **2026 Q3 capture**
 section.
 ## Methodology
 ### Hardware
 - **Runner:** GitHub Actions `ubuntu-latest` (currently Ubuntu 24.04, 2-vCPU, 7-GB RAM).
 - **certctl image:** built from the same commit the workflow runs on.
 - **Postgres:** `postgres:16-alpine@sha256:890480b08124ce7f79960a9bb16fe39729aa302bd384bfd7c408fee6c8f7adb7`, in-cluster, default config (no operator tuning).
 - **Network:** runner localhost.
 ### Software
 - **k6:** version pinned in `deploy/test/loadtest/Dockerfile`.
 - **certctl tag:** the v* tag at workflow trigger time (matches `openapi.yaml info.version`).
 ### Metrics captured
 - **p50 / p95 / p99 latency** — k6's `http_req_duration` percentiles.
 - **Error rate** — k6 `http_req_failed` rate (non-2xx + connection errors).
 - **Peak server RSS** — `docker stats` polled at 1-Hz for the
  duration of the run; `max(memory_stats.usage)` taken from the
  emitted JSON.
 - **Acceptance gate** — the k6 thresholds in `k6.js`; if exceeded
  the workflow fails.
 ### What's NOT captured
 - **Cold-start latency** — these are steady-state baselines after the
  k6 warmup ramp. Cold-start is a separate concern (renewal-loop
  startup, scheduler tick boundary), not covered by these scenarios.
 - **WAN latency** — runs are localhost; production-WAN-RTT additions
  fall outside scope.
 - **Federation overhead** — single-instance only; HA + replicas runs
  are a future deliverable.
 ## Related reading
 - [`docs/operator/scale.md`](scale.md) — the operator-facing scale
  posture doc; baseline rows there point at this file.
 - [`deploy/test/loadtest/README.md`](../../deploy/test/loadtest/README.md) —
  scenario semantics + how to read the k6 output.
 - [`deploy/test/loadtest-artifacts/`](../../deploy/test/loadtest-artifacts/) —
  long-term archive of captured k6 results.
@@ -0,0 +1,244 @@
 # Operator scale guide
 > Last reviewed: 2026-05-16
 Use this when:
 - You're sizing a new certctl deployment for a target fleet count.
 - You're scaling an existing deployment up from demo (15 certs / 1
  agent) to production (1K+ certs / 100+ agents).
 - An auditor asks "what does this scale to?" and you want a documented
  answer that isn't "we haven't measured."
 ## DB connection pool
 certctl's PostgreSQL connection pool is the single largest scale lever.
 Pool exhaustion looks like 503s + agent poll timeouts + scheduler
 falling behind on its loops. The default ships at 50 max open
 connections (`CERTCTL_DATABASE_MAX_CONNS=50`), with idle = max/5 = 10
 under the existing `internal/repository/postgres/db.go::NewDBWithMaxConns`
 contract.
 Operator-tune ladder:
 | Fleet size                  | `CERTCTL_DATABASE_MAX_CONNS` | Postgres `max_connections` | Notes |
 |---|---|---|---|
 | ≤ 500 certs / 100 agents    | `50` (default)               | `100` (PG default)         | Demo + small deployments. Pool default sized for this. |
 | 5K certs / 1K agents        | `100`                        | `200`                      | Postgres needs an explicit bump from the 100 default; reload required. |
 | 50K certs / 10K agents      | `200`                        | `400`                      | Plus dedicated Postgres VM (separate from server host); shared_buffers ≥ 1Gi. |
 Always leave headroom in Postgres's `max_connections` for backups
 (`pg_dump` opens its own connection), ad-hoc psql sessions, and
 replicas. The ratio `(server pool size × replicas) + 20` is a safe
 floor for Postgres's `max_connections`.
 **Numbers above the small-fleet row are operator-tuning starting
 points, not validated ceilings.** Phase 8 of the architecture diligence
 remediation will replace these with measured values from synthetic
 fleets; until then, capture your own observations in a loadtest log
 and tune against them.
 ## Scheduler tick budgets
 certctl has 15 scheduler loops, each with its own cadence
 (internal/scheduler/scheduler.go). The renewal scan is the hottest
 loop on large fleets: it pulls every managed certificate, applies
 each profile's renewal policy, and dispatches an issuance job per
 cert that meets the threshold. The default cadence is `1h`
 (`CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL`).
 Phase 6 SCALE-M5 closure (2026-05-14) added per-ticker jitter via the
 `internal/scheduler.JitteredTicker` wrapper. Each loop's interval is
 unchanged; the wrapper adds ±10% randomized delay per tick so multiple
 loops with the same nominal cadence don't co-fire and cause hour-
 boundary CPU + DB spikes. For most fleets the visible effect is a
 smoother CPU graph during the renewal scan.
 **Renewal-sweep semaphore (SCALE-L1).** The renewal loop dispatches
 concurrent issuance work behind a per-tick semaphore (default
 `CERTCTL_RENEWAL_CONCURRENCY=25`). Under tick-budget pressure (a tick
 that exceeds the loop interval), the semaphore can hold the entire
 concurrency cap until the context cancels at next-tick boundary —
 which is intentional. The drain happens via context cancellation; new
 work isn't started past the deadline. Tests in
 `internal/scheduler/` pin this drain behavior. Operators on large
 fleets should:
 1. Bump `CERTCTL_RENEWAL_CONCURRENCY` to 50 or 100 if the renewal scan
   consistently exceeds tick budget.
 2. Also bump `CERTCTL_DATABASE_MAX_CONNS` proportionally — each
   concurrent renewal task opens its own pool connection during
   issuance / deployment.
 3. Watch for the "renewal scan complete" log line per tick. If it's
   consistently late, you're under-provisioned.
 ## Async CA polling budgets (SCALE-M3)
 DigiCert, Entrust, GlobalSign, and Sectigo are async issuers — they
 accept a CSR, queue it on the CA side, and return a polling token.
 The certctl server polls the CA's status endpoint until the cert is
 ready or the deadline expires. The default poll-deadline is 10
 minutes wall-clock (`asyncpoll.DefaultMaxWait`); after that the
 issuance returns `StillPending` and the scheduler re-enqueues the
 job for the next tick.
 Priority chain when picking the actual deadline (highest → lowest):
 1. Per-connector env: `CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS`,
   `CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS`,
   `CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS`,
   `CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS`.
 2. Global env: `CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS` (sets the
   process-wide default for all async-CA connectors that didn't set
   their per-connector value).
 3. Package const: `asyncpoll.DefaultMaxWait = 10 * time.Minute`.
 Operators with slow async CAs (Entrust certificate-mode in
 particular can take 15-30 minutes during business hours) should
 raise the per-connector value rather than the global; that way fast
 issuers don't pay the polling cost.
 ## Cursor pagination caching (SCALE-L2)
 Phase 6 SCALE-L2 closure (2026-05-14) added an ETag middleware at
 `internal/api/middleware/etag.go` covering the top-5 read endpoints:
 `/api/v1/certificates`, `/api/v1/jobs`, `/api/v1/agents`,
 `/api/v1/audit`, `/api/v1/discovery/certificates`. The ETag is
 derived from `(max-row-updated-at, row-count)` for the requested
 filter; repeated requests with the same query return `304 Not
 Modified` when the underlying data hasn't changed. The dashboard
 benefits most — its polling loop on the certificates page is the
 single largest read-traffic source on most deployments.
 When the cache is effective, repeated reads bypass the
 `SELECT COUNT(*) FROM <table>` query entirely. The cache invalidates
 on any mutation to the table (the row-count + max-updated-at hash
 flips).
 Operators don't need to do anything to opt in — the middleware is
 wired around the top-5 endpoints unconditionally. If you want to
 verify it's working, check the `ETag:` response header on a list
 endpoint and repeat the request with the same value in an
 `If-None-Match:` header — the second request should return 304 with
 an empty body.
 ## Scale-tier scenarios (SCALE-H2, Phase 8)
 Phase 8 (2026-05-14) extended the k6 load-test harness with three new
 scenarios that exercise the scale-relevant load surfaces the original
 API tier left uncovered. They live behind a compose profile gate
 (`docker compose --profile scale`) so the default `make loadtest`
 stays focused on per-PR regression scope. The full set runs weekly on
 the same `loadtest.yml` cron as the API + connector tier.
 | Scenario | k6 file | Seed fixture | Sustained load |
 |---|---|---|---|
 | Bulk-renewal under load | `deploy/test/loadtest/k6/bulk_renewal.js` | 10,000 managed_certificates (`seed/01_bulk_renewal_certs.sql`) | 5 req/s POST `/api/v1/certificates/bulk-renew` × 5 min |
 | ACME enrollment burst | `deploy/test/loadtest/k6/acme_burst.js` | (none — unauth surface) | 200 concurrent VUs × directory/nonce/ARI × 5 min |
 | Agent heartbeat storm | `deploy/test/loadtest/k6/agent_storm.js` | 5,000 agents (`seed/02_agent_fleet.sql`) | 167 req/s POST `/api/v1/agents/{id}/heartbeat` × 5 min |
 ### Threshold contracts (regression guards, NOT measured baselines)
 | Scenario | Metric | Threshold |
 |---|---|---|
 | Bulk-renewal | `http_req_duration{scenario:bulk_renewal}` p99 | < 5 s |
 | Bulk-renewal | `http_req_duration{scenario:bulk_renewal}` p95 | < 2 s |
 | Bulk-renewal | `http_req_failed{scenario:bulk_renewal}` | < 1% |
 | ACME burst | `acme_directory_duration` p95 | < 500 ms |
 | ACME burst | `acme_new_nonce_duration` p95 | < 300 ms |
 | ACME burst | `acme_renewal_info_duration` p95 | < 800 ms |
 | ACME burst | `http_req_failed{server_error:true}` 5xx-only | < 0.1% |
 | Agent storm | `http_req_duration{scenario:agent_storm}` p99 | < 1 s |
 | Agent storm | `http_req_duration{scenario:agent_storm}` p95 | < 500 ms |
 | Agent storm | `http_req_failed{scenario:agent_storm}` | < 0.1% |
 429 rate-limit responses on the ACME burst are EXPECTED — Phase 5's
 per-account rate limiter SHOULD fire at sustained 200-VU pressure.
 The custom `acme_rate_limited_count` Counter tracks how often it
 fires; `acme_rate_limit_shape_ok` Counter verifies every 429 returns
 the RFC 7807 `application/problem+json` shape with the
 `urn:ietf:params:acme:error:rateLimited` type. A regression that
 returned plain-text 429 or a different problem type would surface as
 `(rate_limited_count - shape_ok_count) > 0` in the summary.
 ### Measured baseline
 TEST-005 closure (Sprint 5, 2026-05-16) moved the baseline table out
 of this file into its own canonical record:
 [`docs/operator/scale-baseline-2026-Q2.md`](scale-baseline-2026-Q2.md).
 That doc owns the capture procedure, the methodology, and the
 per-scenario rows; this page links to it as the authoritative
 source.
 The split exists because the baseline table is mutable on every
 loadtest workflow_dispatch run, while this page (the operator-facing
 scale posture doc) changes only when the underlying scenarios or
 thresholds change. Keeping them in separate files avoids
 review-noise on per-capture commits.
 Long-term k6 NDJSON artifacts beyond GHA's 90-day retention live at
 [`deploy/test/loadtest-artifacts/`](../../deploy/test/loadtest-artifacts/).
 ### How to run the scale tier locally
 ```sh
 # All three scenarios serially (~18 min total):
 make loadtest-scale
 # Individual scenarios (each ~6 min):
 make loadtest-scale-bulk     # 10K cert bulk-renew
 make loadtest-scale-acme     # 200 VU ACME burst
 make loadtest-scale-agent    # 5K agent heartbeat storm
 ```
 Each scenario boots its own copy of the loadtest compose stack
 (postgres + tls-init + certctl-server) plus the `scale-seed` init
 container that runs the SQL fixtures from `deploy/test/loadtest/seed/`.
 The seed is idempotent (`ON CONFLICT … DO NOTHING`) so re-running a
 scenario against the same compose stack is cheap.
 ### Documented limitations of the scale tier
 - **JWS-signed ACME flows are not measured.** The ACME burst scenario
  hits the unauthenticated directory + new-nonce + ARI surface only.
  Measuring the JWS-signed POST hot path (new-account / new-order /
  finalize) requires bundling a JWS signer into the k6 driver (k6
  doesn't ship JWS). End-to-end JWS conformance is gated by
  `make acme-rfc-conformance-test` which drives `lego` against the
  same stack.
 - **Scheduler renewal scan throughput.** The bulk-renewal scenario
  measures the inbound POST throughput; the scheduler's
  `jobProcessorLoop` drains the enqueued jobs at a fixed per-tick
  budget (`CERTCTL_RENEWAL_CONCURRENCY=25` default), and the
  throughput of that path is not amplified by adding more inbound
  bulk-renew calls. A future scenario could pull
  `/api/v1/jobs?status=pending` and measure drain time.
 - **Production-sized Postgres.** The compose stack runs
  `postgres:16-alpine` with default config on a CI runner.
  Production deploys with `shared_buffers >= 1 GiB` + dedicated
  Postgres VM will have different query plans for the 10K-cert
  scan. The captured numbers translate directionally but the
  absolute ceiling is workload-specific — see the operator-tune
  ladder above for production sizing.
 - **Pull-only deployment model.** Agent CSR submit, work-poll, and
  deploy-verify paths are intentionally out of scope. The heartbeat
  storm exercises the highest-frequency call on a typical fleet;
  the work-poll path runs at the same cadence but is cheap (empty
  set returned 99% of the time).
 ## Profiling production
 When the above ladder doesn't fit your shape, profile against your
 specific workload. The
 [performance-baselines.md](performance-baselines.md) runbook has
 single-endpoint, inventory-walk, and renewal-scan recipes you can
 adapt.
 ## Related reading
 - [`docs/operator/performance-baselines.md`](performance-baselines.md) —
  per-endpoint baselines + how to re-baseline after upgrades.
 - [`docs/operator/runbooks/postgres-backup.md`](runbooks/postgres-backup.md) —
  Postgres-side backup discipline (necessary precondition for any
  scale tuning).
 - [`deploy/ENVIRONMENTS.md`](../../deploy/ENVIRONMENTS.md) — the
  full env-var inventory the values referenced above come from.
@@ -58,7 +58,55 @@ For certificates issued to systems where revocation correctness matters:
 ## Postgres transport encryption
-See [docs/database-tls.md](database-tls.md).
+**Audit references:** SEC-013 (advisory) and SEC-014 (host-port bind),
 both closed in Sprint 2 of the 2026-Q2 acquisition audit
 (2026-05-16).
 The full upgrade procedure (sslmode flags, CA bundle paths, Helm chart
 values, AWS RDS / Google Cloud SQL / Azure Database notes) lives at
 [docs/operator/database-tls.md](database-tls.md). The summary of the
 two operator-visible defenses certctl ships:
 ### SEC-014 — Postgres host port is loopback-only
 `deploy/docker-compose.yml` and `deploy/docker-compose.test.yml` both
 publish Postgres on `127.0.0.1:5432:5432` rather than `5432:5432`.
 The default Docker port-binding behavior is to bind to `0.0.0.0`,
 which exposes Postgres on every interface of the host — including any
 public-facing NICs the operator did not realize were attached. The
 loopback bind closes that footgun without breaking the
 certctl-server hop (which goes over the `certctl-network` Docker
 bridge, not over the host port).
 Operators who genuinely need to reach Postgres from another host —
 e.g. a separate metrics box running `postgres_exporter` — should
 either (1) attach that host into the same Docker network, (2) tunnel
 through SSH (`ssh -L`), or (3) re-publish the port with explicit
 `bind:` configuration and a documented network-layer access control.
 Loosening the loopback bind without one of those is a regression.
 ### SEC-013 — advisory WARN on external `sslmode=disable`
 `internal/config/config.go::Validate` emits an `slog.Warn` (NOT a
 fail-closed error) when `CERTCTL_DATABASE_URL` parses as a Postgres
 URL with `sslmode=disable` AND the host is outside the local
 safelist (`localhost` / `127.0.0.1` / `::1` / `postgres` /
 `certctl-postgres` / `*.svc.cluster.local`). The advisory exists
 because the legitimate compose / Helm topology genuinely uses
 `sslmode=disable` over the Docker bridge — failing closed would
 break the production-shaped quickstart — but pointing
 `CERTCTL_DATABASE_URL` at a managed-Postgres host (RDS, Cloud SQL,
 Azure Database) without flipping `sslmode` to `verify-full` puts
 the entire control plane's Postgres traffic on the wire in
 cleartext. The WARN surfaces that landmine on every boot so the
 operator notices it in the journal even if the rest of the boot
 sequence looks healthy.
 To clear the WARN: set `CERTCTL_DATABASE_URL` to a URL with
 `sslmode=verify-full` and `sslrootcert=<ca-bundle-path>`. The full
 procedure (CA-bundle materialization, Helm chart values, secret-
 manager wiring) is in
 [docs/operator/database-tls.md](database-tls.md).
 ## Encryption at rest
@@ -1,6 +1,6 @@
 # Architecture Guide
-> Last reviewed: 2026-05-05
+> Last reviewed: 2026-05-16
 ## Contents
@@ -55,6 +55,45 @@ New to certificates? Read the [Concepts Guide](concepts.md) first.
 7. **Connector Architecture** — Pluggable issuers, targets, and notifiers for extensibility
 8. **Self-Hosted** — No cloud lock-in; run with Docker Compose, Kubernetes, or bare metal
 ### Single-tenant deployment model
 certctl runs as a **single-tenant** application today. Every authenticated
 request is stamped with `auth.DefaultTenantID` by the auth middleware
 (`internal/auth/middleware.go` — the `TenantIDKey` context value is
 constant for the process lifetime), and repository queries don't filter
 on tenant. A deploy is one tenant; a buyer running multiple business
 units on one cluster needs one certctl deployment per business unit.
 The `tenant_id` columns sprinkled across the schema (`actors`,
 `managed_certificates`, `agents`, `users`, `roles`, audit log, etc.) are
 **forward-compatible scaffolding** for the multi-tenancy roadmap item
 in `WORKSPACE-ROADMAP.md`, not active multi-tenant code. A repo skimmer
 who sees the columns can reasonably assume tenant isolation is wired
 end-to-end; it isn't. The `scripts/ci-guards/multi-tenant-query-coverage.sh`
 guard exists to track the drift and is treated as informational (warns
 on net-new tenant_id-less queries above a baseline) — flipping it to a
 hard gate is the inflection-point work for activating multi-tenancy.
 Lifting this to a multi-tenant deployment requires three pieces of
 work in sequence:
 1. **Request-derived tenant resolution.** Replace the constant
   `DefaultTenantID` stamp with a resolution function that picks
   the tenant from the actor (`actors.tenant_id`) or a hostname /
   path-prefix routing convention.
 2. **Per-query tenant scoping.** Every `WHERE` clause that joins
   on a `tenant_id`-bearing table must add `AND tenant_id = $N`.
   The multi-tenant-query-coverage guard tracks this surface;
   activating multi-tenancy means driving its baseline to zero.
 3. **Per-tenant resource quotas + isolation tests.** RBAC scope
   types extend with `tenant`; integration tests exercise
   cross-tenant data-leak prevention; quotas (certs/issuers/agents
   per tenant) wire into the existing limit-enforcement layer.
 Until that work lands, **the multi-tenant columns are decorative**.
 Treat them as you would a Postgres `version` column on a row you
 never read — the schema is forward-compat, the runtime is not.
 ## System Components
 ```mermaid
@@ -1,6 +1,6 @@
 # Connector Development Guide
-> Last reviewed: 2026-05-05
+> Last reviewed: 2026-05-16
 >
 > This is the canonical connector reference: interface contracts,
 > registry, deployment primitive, network scanner, cloud discovery.
@@ -41,13 +41,23 @@ Target connectors:
 - [HAProxy](haproxy.md) — combined-PEM deploy + `haproxy -c` validate
 - [IIS](iis.md) — Microsoft IIS, local PowerShell + WinRM modes
 - [Java Keystore](jks.md) — JKS / PKCS#12 via `keytool` with atomic snapshot rollback
 - [Kubernetes Secrets](k8s.md) — k8s.io/tls Secrets atomic update
 - [NGINX](nginx.md) — separate-file deploy + `nginx -t` validate
 - [Postfix / Dovecot](postfix.md) — dual-mode mail-server TLS connector
 - [SSH (agentless)](ssh.md) — agentless deploy over SSH/SFTP for Linux/Unix targets
 - [Traefik](traefik.md) — file-provider zero-reload deploy
 - [Windows Certificate Store](wincertstore.md) — non-IIS Windows services (Exchange, RDP, SQL, ADFS)
 ### Preview connectors (not in the production-ready set)
 SEC-003-K8S closure (Sprint 4, 2026-05-16) moved Kubernetes Secrets
 out of the canonical fourteen-target index because the production
 client-go integration is not yet wired — the connector ships but
 refuses to register without `CERTCTL_K8SSECRET_PREVIEW_ACK=true`
 and the CRUD methods return *"real Kubernetes client not
 implemented"* until the integration lands.
 - [Kubernetes Secrets](k8s.md) — **preview** — k8s.io/tls Secrets atomic update. See [`docs/reference/deployment-model.md`](../deployment-model.md) row `k8ssecret` for the bundle-2 V2-blocker scope.
 ## Contents
 1. [Overview](#overview)
@@ -109,7 +119,7 @@ Target connectors:
 Three types of connectors:
 1. **Issuer Connector** — Obtains certificates from CAs. 12 built-in: Local CA (self-signed + sub-CA + tree mode; ADCS sub-CA mode is documented separately), ACME v2 (HTTP-01, DNS-01, DNS-PERSIST-01, ARI, EAB, profile selection), step-ca, OpenSSL/Custom CA, Vault PKI, DigiCert CertCentral, Sectigo SCM, Google CAS, AWS ACM Private CA, Entrust Certificate Services, GlobalSign Atlas HVCA, EJBCA (Keyfactor)
-2. **Target Connector** — Deploys certificates to infrastructure. 15 built-in: NGINX, Apache httpd, HAProxy, Traefik, Caddy, Envoy, Postfix/Dovecot (dual-mode), IIS (local PowerShell + WinRM proxy), F5 BIG-IP (proxy agent), SSH (agentless), Windows Certificate Store, Java Keystore (JKS / PKCS#12), Kubernetes Secrets, AWS Certificate Manager, Azure Key Vault
+2. **Target Connector** — Deploys certificates to infrastructure. 14 production-ready: NGINX, Apache httpd, HAProxy, Traefik, Caddy, Envoy, Postfix/Dovecot (dual-mode), IIS (local PowerShell + WinRM proxy), F5 BIG-IP (proxy agent), SSH (agentless), Windows Certificate Store, Java Keystore (JKS / PKCS#12), AWS Certificate Manager, Azure Key Vault. Plus Kubernetes Secrets shipped as preview — see the *Preview connectors* subsection above for the ACK gate.
 3. **Notifier Connector** — Sends alerts about certificate events (Email, Webhooks, Slack, Microsoft Teams, PagerDuty, OpsGenie implemented)
 All connectors accept JSON configuration at initialization, support config validation, and are registered in the service layer. Issuer connectors run on the control plane; target connectors run on agents. For network appliances where agents can't be installed, a **proxy agent** in the same network zone handles deployment — the server never initiates outbound connections.
@@ -0,0 +1,111 @@
 # MCP ↔ REST API parity coverage
 > Last reviewed: 2026-05-16
 ## What this file is
 This is the canonical record of which certctl REST routes are exposed
 as MCP (Model Context Protocol) tools, plus the explicit allowlist of
 routes that are intentionally NOT exposed. The companion CI guard
 `scripts/ci-guards/mcp-coverage-parity.sh` fails the build if a new
 REST route lands without either an MCP tool wrapping it or an
 explicit allowlist entry justifying the exclusion.
 Before ARCH-004 (Sprint 4, 2026-05-16) the README said *"the full REST
 API is exposed as MCP tools"* with no published coverage data. That
 wording was an overclaim — see the audit trail in `git log --grep='ARCH-004'`.
 ## Current numbers
 Re-derive at any time:
 ```bash
 # REST routes registered by the router
 grep -cE '^\s*r\.Register\(' internal/api/router/router.go
 # MCP tools registered (counts gomcp.AddTool call sites)
 grep -rcE 'gomcp\.AddTool' internal/mcp/ --include='*.go' \
  | grep -v '_test.go' | awk -F: '{s+=$2} END{print s}'
 ```
 At the most recent verification (2026-05-16): **221 routes / 162 tools**.
 ## Coverage categories
 The gap between routes and tools is intentional and falls into four
 named exclusion categories. Adding a new REST route in any of these
 categories does NOT require a paired MCP tool — but it DOES require
 an allowlist entry in the CI guard.
 ### 1. Protocol-conformance endpoints
 Routes that implement a wire protocol an automated client (cert-manager,
 certbot, lego, MS Intune, EST devices, OCSP responders, CRL fetchers)
 talks to directly. These are not human-driven API calls; the MCP
 "natural language → tool call" model doesn't fit them. The MCP server
 SHOULD NOT wrap these because exposing them would invite operators to
 ask an AI agent to "renew the cert via ACME" when the right answer is
 "the ACME client your existing infra already runs handles that."
 - `/acme/*` — RFC 8555 + RFC 9773 (ACME server)
 - `/scep/*` — RFC 8894 (SCEP server, MS Intune)
 - `/.well-known/est/*` — RFC 7030 (EST server)
 - `/ocsp` — RFC 6960 (OCSP responder)
 - `/.well-known/pki/crl/*` — RFC 5280 CRL distribution
 ### 2. Browser-only auth flow endpoints
 OIDC SSO + CSRF + bootstrap routes that exist solely for the GUI's
 session establishment dance. An MCP client should authenticate via
 the same API-key Bearer path the REST callers use; exposing the
 browser flow as a tool would be incoherent.
 - `/auth/oidc/login`
 - `/auth/oidc/callback`
 - `/auth/oidc/back-channel-logout`
 - `POST /api/v1/auth/bootstrap` (one-shot day-0 admin)
 - `POST /api/v1/auth/login`, `POST /api/v1/auth/logout`
 - `GET /api/v1/auth/csrf`
 ### 3. Liveness / readiness / version
 Out of scope for natural-language workflows.
 - `/health`
 - `/ready`
 - `/api/v1/version`
 ### 4. Streaming / binary download endpoints
 The MCP tool contract is request → response JSON. Binary streaming
 and chunked transfer don't fit the shape and would force lossy
 encoding (base64-wrapped JSON blobs) the operator wouldn't actually
 use through an AI assistant.
 - `GET /api/v1/certificates/{id}/download` — raw PEM
 - `GET /api/v1/certificates/{id}/chain` — chain PEM
 - `GET /api/v1/intermediate-cas/{id}/cert` — raw cert
 - `GET /api/v1/metrics/prometheus` — Prometheus text format
 ## How to add a new route
 1. Add the route in `internal/api/router/router.go`.
 2. Decide: should an AI assistant be able to invoke this?
   - **Yes** → add a matching `gomcp.AddTool` call in `internal/mcp/`.
   - **No** → confirm the route fits one of the four exclusion
     categories above AND add an entry to the allowlist in
     `scripts/ci-guards/mcp-coverage-parity.sh`.
 3. The CI guard will fail until either branch is satisfied.
 If the route doesn't fit any of the four categories and you don't
 want it in MCP for another reason, name a fifth category in this
 file and update the CI guard. The list is meant to grow with the
 product, not contain it.
 ## Why this matters
 certctl is sold to operators who'll use AI assistants to drive it.
 "Most of the REST API" is a meaningful coverage claim; "the full REST
 API" was not. Diligence reviewers and operators evaluating MCP-driven
 workflows need the explicit gap surface — both to plan their
 automation around the gap and to spot when the gap drifts.
@@ -4,12 +4,12 @@
 <!-- Re-run after adding or removing any t.Skip(). CI guard:    -->
 <!-- scripts/ci-guards/skip-inventory-drift.sh                  -->
-> Last reviewed: 2026-05-13
+> Last reviewed: 2026-05-16
 ## Summary
- Total t.Skip sites: **142**
+- Total t.Skip sites: **147**
- testing.Short() guards: **76** (these gate behind `go test -short`)
+- testing.Short() guards: **82** (these gate behind `go test -short`)
 Re-run inventory with: `./scripts/skip-inventory.sh`.
@@ -103,7 +103,7 @@ Re-run inventory with: `./scripts/skip-inventory.sh`.
 ### `internal/auth/oidc/domain`
- `internal/auth/oidc/domain/types_test.go:186` — t.Skip()
+- `internal/auth/oidc/domain/types_test.go:221` — t.Skip()
 ### `internal/auth/oidc`
@@ -114,7 +114,7 @@ Re-run inventory with: `./scripts/skip-inventory.sh`.
 ### `internal/ciparity`
- `internal/ciparity/surface_parity_test.go:97` — // readFileOrSkip reads a file; on ENOENT, calls t.Skipf rather than
+- `internal/ciparity/surface_parity_test.go:113` — // readFileOrSkip reads a file; on ENOENT, calls t.Skipf rather than
 ### `internal/connector/issuer/acme`
@@ -156,10 +156,15 @@ Re-run inventory with: `./scripts/skip-inventory.sh`.
 ### `internal/ratelimit`
 - `internal/ratelimit/equivalence_test.go:80` — t.Skip("race-style test under -short")
 - `internal/ratelimit/equivalence_test.go:88` — t.Skip("postgres equivalence tests require testcontainers; skipped under -short")
 - `internal/ratelimit/sliding_window_test.go:146` — t.Skip("race-style test under -short")
 ### `internal/repository/postgres`
 - `internal/repository/postgres/audit_chain_test.go:137` — t.Skip("skipping integration test in short mode")
 - `internal/repository/postgres/audit_chain_test.go:36` — t.Skip("skipping integration test in short mode")
 - `internal/repository/postgres/audit_chain_test.go:58` — t.Skip("skipping integration test in short mode")
 - `internal/repository/postgres/audit_worm_test.go:29` — t.Skip("skipping integration test in short mode")
 - `internal/repository/postgres/auth_revoke_scope_test.go:118` — t.Skip("integration test in short mode")
 - `internal/repository/postgres/auth_revoke_scope_test.go:149` — t.Skip("integration test in short mode")
@@ -23,12 +23,25 @@ require (
 	github.com/leanovate/gopter v0.2.11
 	github.com/masterzen/winrm v0.0.0-20250927112105-5f8e6c707321
 	github.com/pkg/sftp v1.13.10
 	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.43.0
 	go.opentelemetry.io/otel/sdk v1.43.0
 	golang.org/x/crypto v0.50.0
 	golang.org/x/oauth2 v0.36.0
 	golang.org/x/sync v0.20.0
 	software.sslmate.com/src/go-pkcs12 v0.7.0
 )
 require (
 	github.com/cenkalti/backoff/v5 v5.0.3 // indirect
 	github.com/grpc-ecosystem/grpc-gateway/v2 v2.28.0 // indirect
 	go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.43.0 // indirect
 	go.opentelemetry.io/proto/otlp v1.10.0 // indirect
 	google.golang.org/genproto/googleapis/api v0.0.0-20260504160031-60b97b32f348 // indirect
 	google.golang.org/genproto/googleapis/rpc v0.0.0-20260504160031-60b97b32f348 // indirect
 	google.golang.org/grpc v1.80.0 // indirect
 	google.golang.org/protobuf v1.36.11 // indirect
 )
 require (
 	dario.cat/mergo v1.0.2 // indirect
 	github.com/Azure/azure-sdk-for-go/sdk/internal v1.11.2 // indirect
@@ -110,9 +123,9 @@ require (
 	github.com/yusufpapurcu/wmi v1.2.4 // indirect
 	go.opentelemetry.io/auto/sdk v1.2.1 // indirect
 	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.60.0 // indirect
-	go.opentelemetry.io/otel v1.41.0 // indirect
+	go.opentelemetry.io/otel v1.43.0
-	go.opentelemetry.io/otel/metric v1.41.0 // indirect
+	go.opentelemetry.io/otel/metric v1.43.0 // indirect
-	go.opentelemetry.io/otel/trace v1.41.0 // indirect
+	go.opentelemetry.io/otel/trace v1.43.0 // indirect
 	golang.org/x/net v0.53.0 // indirect
 	golang.org/x/sys v0.43.0 // indirect
 	golang.org/x/text v0.36.0 // indirect
@@ -111,6 +111,8 @@ github.com/bodgit/windows v1.0.1 h1:tF7K6KOluPYygXa3Z2594zxlkbKPAOvqr97etrGNIz4=
 github.com/bodgit/windows v1.0.1/go.mod h1:a6JLwrB4KrTR5hBpp8FI9/9W9jJfeQ2h4XDXU74ZCdM=
 github.com/cenkalti/backoff/v4 v4.3.0 h1:MyRJ/UdXutAwSAT+s3wNd7MfTIcy71VQueUuFK343L8=
 github.com/cenkalti/backoff/v4 v4.3.0/go.mod h1:Y3VNntkOUPxTVeUxJ/G5vcM//AlwfmyYozVcomhLiZE=
 github.com/cenkalti/backoff/v5 v5.0.3 h1:ZN+IMa753KfX5hd8vVaMixjnqRZ3y8CuJKRKj1xcsSM=
 github.com/cenkalti/backoff/v5 v5.0.3/go.mod h1:rkhZdG3JZukswDf7f0cwqPNk4K0sa+F97BxZthm/crw=
 github.com/census-instrumentation/opencensus-proto v0.2.1/go.mod h1:f6KPmirojxKA12rnyqOA5BBL4O983OfeGPqjHWSTneU=
 github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
 github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs=
@@ -208,6 +210,8 @@ github.com/golang/protobuf v1.4.3/go.mod h1:oDoupMAO8OvCJWAcko0GGGIgR6R6ocIYbsSw
 github.com/golang/protobuf v1.5.0/go.mod h1:FsONVRAS9T7sI+LIUmWTfcYkHO4aIWwzhcaSAoJOfIk=
 github.com/golang/protobuf v1.5.1/go.mod h1:DopwsBzvsk0Fs44TXzsVbJyPhcCPeIwnvohx4u74HPM=
 github.com/golang/protobuf v1.5.2/go.mod h1:XVQd3VNwM+JqD3oG2Ue2ip4fOMUkwXdXDdiuN0vRsmY=
 github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek=
 github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps=
 github.com/google/btree v0.0.0-20180813153112-4030bb1f1f0c/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ=
 github.com/google/btree v1.0.0/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ=
 github.com/google/go-cmp v0.2.0/go.mod h1:oXzfMopK8JAjlY9xF4vHSVASa0yLyX7SntLO5aqRK0M=
@@ -254,6 +258,8 @@ github.com/gorilla/securecookie v1.1.1/go.mod h1:ra0sb63/xPlUeL+yeDciTfxMRAA+MP+
 github.com/gorilla/sessions v1.2.1 h1:DHd3rPN5lE3Ts3D8rKkQ8x/0kqfeNmBAaiSi+o7FsgI=
 github.com/gorilla/sessions v1.2.1/go.mod h1:dk2InVEVJ0sfLlnXv9EAgkf6ecYs/i80K/zI+bUmuGM=
 github.com/grpc-ecosystem/grpc-gateway v1.16.0/go.mod h1:BDjrQk3hbvj6Nolgz8mAMFbcEtjT1g+wF4CSlocrBnw=
 github.com/grpc-ecosystem/grpc-gateway/v2 v2.28.0 h1:HWRh5R2+9EifMyIHV7ZV+MIZqgz+PMpZ14Jynv3O2Zs=
 github.com/grpc-ecosystem/grpc-gateway/v2 v2.28.0/go.mod h1:JfhWUomR1baixubs02l85lZYYOm7LV6om4ceouMv45c=
 github.com/hashicorp/consul/api v1.1.0/go.mod h1:VmuI/Lkw1nC05EYQWNKwWGbkg+FbDBtguAZLlVdkD9Q=
 github.com/hashicorp/consul/sdk v0.1.1/go.mod h1:VKf9jXwCTEY1QZP2MOLRhb5i/I/ssyNV1vwHyQBF0x8=
 github.com/hashicorp/errwrap v1.0.0/go.mod h1:YH+1FKiLXxHSkmPseP+kNlulaMuP3n2brvKWEqk/Jc4=
@@ -461,17 +467,25 @@ go.opentelemetry.io/auto/sdk v1.2.1 h1:jXsnJ4Lmnqd11kwkBV2LgLoFMZKizbCi5fNZ/ipaZ
 go.opentelemetry.io/auto/sdk v1.2.1/go.mod h1:KRTj+aOaElaLi+wW1kO/DZRXwkF4C5xPbEe3ZiIhN7Y=
 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.60.0 h1:sbiXRNDSWJOTobXh5HyQKjq6wUC5tNybqjIqDpAY4CU=
 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.60.0/go.mod h1:69uWxva0WgAA/4bu2Yy70SLDBwZXuQ6PbBpbsa5iZrQ=
-go.opentelemetry.io/otel v1.41.0 h1:YlEwVsGAlCvczDILpUXpIpPSL/VPugt7zHThEMLce1c=
+go.opentelemetry.io/otel v1.43.0 h1:mYIM03dnh5zfN7HautFE4ieIig9amkNANT+xcVxAj9I=
-go.opentelemetry.io/otel v1.41.0/go.mod h1:Yt4UwgEKeT05QbLwbyHXEwhnjxNO6D8L5PQP51/46dE=
+go.opentelemetry.io/otel v1.43.0/go.mod h1:JuG+u74mvjvcm8vj8pI5XiHy1zDeoCS2LB1spIq7Ay0=
-go.opentelemetry.io/otel/metric v1.41.0 h1:rFnDcs4gRzBcsO9tS8LCpgR0dxg4aaxWlJxCno7JlTQ=
+go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.43.0 h1:88Y4s2C8oTui1LGM6bTWkw0ICGcOLCAI5l6zsD1j20k=
-go.opentelemetry.io/otel/metric v1.41.0/go.mod h1:xPvCwd9pU0VN8tPZYzDZV/BMj9CM9vs00GuBjeKhJps=
+go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.43.0/go.mod h1:Vl1/iaggsuRlrHf/hfPJPvVag77kKyvrLeD10kpMl+A=
-go.opentelemetry.io/otel/sdk v1.35.0 h1:iPctf8iprVySXSKJffSS79eOjl9pvxV9ZqOWT0QejKY=
+go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.43.0 h1:3iZJKlCZufyRzPzlQhUIWVmfltrXuGyfjREgGP3UUjc=
-go.opentelemetry.io/otel/sdk v1.35.0/go.mod h1:+ga1bZliga3DxJ3CQGg3updiaAJoNECOgJREo9KHGQg=
+go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.43.0/go.mod h1:/G+nUPfhq2e+qiXMGxMwumDrP5jtzU+mWN7/sjT2rak=
-go.opentelemetry.io/otel/sdk/metric v1.35.0 h1:1RriWBmCKgkeHEhM7a2uMjMUfP7MsOF5JpUCaEqEI9o=
+go.opentelemetry.io/otel/metric v1.43.0 h1:d7638QeInOnuwOONPp4JAOGfbCEpYb+K6DVWvdxGzgM=
-go.opentelemetry.io/otel/sdk/metric v1.35.0/go.mod h1:is6XYCUMpcKi+ZsOvfluY5YstFnhW0BidkR+gL+qN+w=
+go.opentelemetry.io/otel/metric v1.43.0/go.mod h1:RDnPtIxvqlgO8GRW18W6Z/4P462ldprJtfxHxyKd2PY=
-go.opentelemetry.io/otel/trace v1.41.0 h1:Vbk2co6bhj8L59ZJ6/xFTskY+tGAbOnCtQGVVa9TIN0=
+go.opentelemetry.io/otel/sdk v1.43.0 h1:pi5mE86i5rTeLXqoF/hhiBtUNcrAGHLKQdhg4h4V9Dg=
-go.opentelemetry.io/otel/trace v1.41.0/go.mod h1:U1NU4ULCoxeDKc09yCWdWe+3QoyweJcISEVa1RBzOis=
+go.opentelemetry.io/otel/sdk v1.43.0/go.mod h1:P+IkVU3iWukmiit/Yf9AWvpyRDlUeBaRg6Y+C58QHzg=
 go.opentelemetry.io/otel/sdk/metric v1.43.0 h1:S88dyqXjJkuBNLeMcVPRFXpRw2fuwdvfCGLEo89fDkw=
 go.opentelemetry.io/otel/sdk/metric v1.43.0/go.mod h1:C/RJtwSEJ5hzTiUz5pXF1kILHStzb9zFlIEe85bhj6A=
 go.opentelemetry.io/otel/trace v1.43.0 h1:BkNrHpup+4k4w+ZZ86CZoHHEkohws8AY+WTX09nk+3A=
 go.opentelemetry.io/otel/trace v1.43.0/go.mod h1:/QJhyVBUUswCphDVxq+8mld+AvhXZLhe+8WVFxiFff0=
 go.opentelemetry.io/proto/otlp v1.10.0 h1:IQRWgT5srOCYfiWnpqUYz9CVmbO8bFmKcwYxpuCSL2g=
 go.opentelemetry.io/proto/otlp v1.10.0/go.mod h1:/CV4QoCR/S9yaPj8utp3lvQPoqMtxXdzn7ozvvozVqk=
 go.uber.org/atomic v1.7.0/go.mod h1:fEN4uk6kAWBTFdckzkM89CLk9XfWZrxpCo0nPH17wJc=
 go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto=
 go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE=
 go.uber.org/multierr v1.6.0/go.mod h1:cdWPpRnG4AhwMwsgIHip0KRBQjJy5kYEpYjJxpXp9iU=
 go.uber.org/zap v1.17.0/go.mod h1:MXVU+bhUf/A7Xi2HNOnopQOrmycQ5Ih87HtOu4q5SSo=
 golang.org/x/crypto v0.0.0-20181029021203-45a5f77698d3/go.mod h1:6SG95UA2DQfeDnfUPMdvaQW0Q7yPrPDi9nlGo2tz2b4=
@@ -731,6 +745,8 @@ golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8T
 golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
 golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
 golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
 gonum.org/v1/gonum v0.17.0 h1:VbpOemQlsSMrYmn7T2OUvQ4dqxQXU+ouZFQsZOx50z4=
 gonum.org/v1/gonum v0.17.0/go.mod h1:El3tOrEuMpv2UdMrbNlKEh9vd86bmQ6vqIcDwxEOc1E=
 google.golang.org/api v0.4.0/go.mod h1:8k5glujaEP+g9n7WNsDg8QP6cUVNI86fCNMcbazEtwE=
 google.golang.org/api v0.7.0/go.mod h1:WtwebWUNSVBH/HAw79HIFXZNqEvBhG+Ra+ax0hx3E3M=
 google.golang.org/api v0.8.0/go.mod h1:o4eAsZoiT+ibD93RtjEohWalFOjRDx6CVaqeizhEnKg=
@@ -801,6 +817,10 @@ google.golang.org/genproto v0.0.0-20210310155132-4ce2db91004e/go.mod h1:FWY/as6D
 google.golang.org/genproto v0.0.0-20210319143718-93e7006c17a6/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no=
 google.golang.org/genproto v0.0.0-20210402141018-6c239bbf2bb1/go.mod h1:9lPAdzaEmUacj36I+k7YKbEc5CXzPIeORRgDAUOu28A=
 google.golang.org/genproto v0.0.0-20210602131652-f16073e35f0c/go.mod h1:UODoCrxHCcBojKKwX1terBiRUaqAsFqJiF615XL43r0=
 google.golang.org/genproto/googleapis/api v0.0.0-20260504160031-60b97b32f348 h1:U8orV30l6KpDsi9dxU0CoJZGbjS8EEpw+6ba+XwGPQA=
 google.golang.org/genproto/googleapis/api v0.0.0-20260504160031-60b97b32f348/go.mod h1:Yzdzr5OOZFgSsEV2D/Xi9NL3bszpXFAg0hFJiRohcD8=
 google.golang.org/genproto/googleapis/rpc v0.0.0-20260504160031-60b97b32f348 h1:pfIbyB44sWzHiCpRqIen67ZQnVXSfIxWrqUMk1qwODE=
 google.golang.org/genproto/googleapis/rpc v0.0.0-20260504160031-60b97b32f348/go.mod h1:4Hqkh8ycfw05ld/3BWL7rJOSfebL2Q+DVDeRgYgxUU8=
 google.golang.org/grpc v1.19.0/go.mod h1:mqu4LbDTu4XGKhr4mRzUsmM4RtVoemTSY81AxZiDr8c=
 google.golang.org/grpc v1.20.1/go.mod h1:10oTOabMzJvdu6/UiuZezV6QK5dSlG84ov/aaiqXj38=
 google.golang.org/grpc v1.21.1/go.mod h1:oYelfM1adQP15Ek0mdvEgi9Df8B9CZIaU1084ijfRaM=
@@ -821,6 +841,8 @@ google.golang.org/grpc v1.35.0/go.mod h1:qjiiYl8FncCW8feJPdyg3v6XW24KsRHe+dy9BAG
 google.golang.org/grpc v1.36.0/go.mod h1:qjiiYl8FncCW8feJPdyg3v6XW24KsRHe+dy9BAGRRjU=
 google.golang.org/grpc v1.36.1/go.mod h1:qjiiYl8FncCW8feJPdyg3v6XW24KsRHe+dy9BAGRRjU=
 google.golang.org/grpc v1.38.0/go.mod h1:NREThFqKR1f3iQ6oBuvc5LadQuXVGo9rkm5ZGrQdJfM=
 google.golang.org/grpc v1.80.0 h1:Xr6m2WmWZLETvUNvIUmeD5OAagMw3FiKmMlTdViWsHM=
 google.golang.org/grpc v1.80.0/go.mod h1:ho/dLnxwi3EDJA4Zghp7k2Ec1+c2jqup0bFkw07bwF4=
 google.golang.org/protobuf v0.0.0-20200109180630-ec00e32a8dfd/go.mod h1:DFci5gLYBciE7Vtevhsrf46CRTquxDuWsQurQQe4oz8=
 google.golang.org/protobuf v0.0.0-20200221191635-4d8936d0db64/go.mod h1:kwYJMbMJ01Woi6D6+Kah6886xMZcty6N08ah7+eCXa0=
 google.golang.org/protobuf v0.0.0-20200228230310-ab0ca4ff8a60/go.mod h1:cfTl7dwQJ+fmap5saPgwCLgHXTUD7jkjRqWcaiX5VyM=
@@ -833,6 +855,8 @@ google.golang.org/protobuf v1.24.0/go.mod h1:r/3tXBNzIEhYS9I1OUVjXDlt8tc493IdKGj
 google.golang.org/protobuf v1.25.0/go.mod h1:9JNX74DMeImyA3h4bdi1ymwjUzf21/xIlbajtzgsN7c=
 google.golang.org/protobuf v1.26.0-rc.1/go.mod h1:jlhhOSvTdKEhbULTjvd4ARK9grFBp09yW+WbY/TyQbw=
 google.golang.org/protobuf v1.26.0/go.mod h1:9q0QmTI4eRPtz6boOQmLYwt+qCgq0jsYwAQnmE0givc=
 google.golang.org/protobuf v1.36.11 h1:fV6ZwhNocDyBLK0dj+fg8ektcVegBBuEolpbTQyBNVE=
 google.golang.org/protobuf v1.36.11/go.mod h1:HTf+CrKn2C3g5S8VImy6tdcUvCska2kB7j23XfzDpco=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
 gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
 gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
@@ -201,7 +201,35 @@ check_privileges() {
    fi
 }
-# Download agent binary from GitHub Releases
+# Download + verify agent binary from GitHub Releases.
 #
 # Acquisition-audit RED-007 closure (Sprint 7 ACQ, 2026-05-16). Pre-
 # 2026-05-16 the script downloaded the binary with no integrity check
 # — a tampered binary on the release surface, a MITM downgrade
 # (HTTPS already prevents in-flight tampering but a compromised
 # release-asset upload would not surface here), or a misnamed asset
 # would all install silently. The download path now performs two
 # independent verifications:
 #
 #   1. SHA-256 against the published checksums.txt sidecar
 #      (.github/workflows/release.yml aggregate-checksums job).
 #      sha256sum is in coreutils on Linux; macOS ships `shasum`,
 #      which we fall back to.
 #   2. Cosign keyless verify against the project's GitHub OIDC
 #      identity (sigstore/cosign-installer pinned in release.yml).
 #      The signature bundle is the `<binary>.sigstore.json` sibling
 #      asset every release publishes. Cosign verify is OPTIONAL
 #      when the operator doesn't have cosign installed — the
 #      script logs a clear WARN and proceeds; operators in
 #      regulated environments MUST install cosign first
 #      (curl -sSL https://github.com/sigstore/cosign/releases/...)
 #      and re-run.
 #
 # Both verifications happen against the temp file BEFORE
 # install_binary copies it to $INSTALL_DIR. A failed checksum
 # rejects the install. A failed cosign verify also rejects the
 # install. Either rejection rm -f's the temp file and exits 1.
 #
 # IMPORTANT: main() captures this function's stdout via `binary_path=$(download_binary)`,
 # so every status/error message MUST go to stderr (>&2). Only the final
 # `echo "$temp_file"` is allowed on stdout — that's the return value.
@@ -222,16 +250,109 @@ download_binary() {
        exit 1
    fi
-    local temp_file
+    local temp_file temp_sigstore temp_checksums
    temp_file=$(mktemp)
    temp_sigstore=$(mktemp --suffix=.sigstore.json 2>/dev/null || mktemp -t sigstore)
    temp_checksums=$(mktemp)
    if ! curl -sSL -f "$download_url" -o "$temp_file" >&2; then
-        rm -f "$temp_file"
+        rm -f "$temp_file" "$temp_sigstore" "$temp_checksums"
        echo -e "${RED}Error: Failed to download binary from $download_url${NC}" >&2
        echo "Make sure the latest release exists on GitHub with the binary asset for ${OS_TYPE}-${ARCH_TYPE}." >&2
        exit 1
    fi
    # ---- SHA-256 verify against the release-published checksums.txt ----
    #
    # Every release publishes a single checksums.txt (sha256sum format) +
    # a cosign signature on it (checksums.txt.sigstore.json). Downloading
    # via the same RELEASE_URL keeps the integrity chain rooted at the
    # GitHub-release surface (not a sibling CDN), so a release-asset
    # tamper is caught by the very first hash comparison.
    echo -e "${YELLOW}Downloading checksums.txt for SHA-256 verification...${NC}" >&2
    if ! curl -sSL -f "${RELEASE_URL}/checksums.txt" -o "$temp_checksums" >&2; then
        rm -f "$temp_file" "$temp_sigstore" "$temp_checksums"
        echo -e "${RED}Error: Failed to download checksums.txt from ${RELEASE_URL}.${NC}" >&2
        echo "The agent binary cannot be installed without integrity verification." >&2
        exit 1
    fi
    # Look up the binary's expected hash in the checksums file.
    local expected_hash
    expected_hash=$(awk -v name="$binary_name" '$2 == name {print $1; exit}' "$temp_checksums")
    if [[ -z "$expected_hash" ]]; then
        rm -f "$temp_file" "$temp_sigstore" "$temp_checksums"
        echo -e "${RED}Error: checksums.txt has no entry for $binary_name.${NC}" >&2
        echo "The release surface is inconsistent — refusing to install." >&2
        exit 1
    fi
    local actual_hash sha_tool
    if command -v sha256sum &> /dev/null; then
        sha_tool="sha256sum"
        actual_hash=$(sha256sum "$temp_file" | awk '{print $1}')
    elif command -v shasum &> /dev/null; then
        sha_tool="shasum -a 256"
        actual_hash=$(shasum -a 256 "$temp_file" | awk '{print $1}')
    else
        rm -f "$temp_file" "$temp_sigstore" "$temp_checksums"
        echo -e "${RED}Error: neither sha256sum nor shasum is installed.${NC}" >&2
        echo "Install coreutils (Linux) or shasum (macOS) and re-run." >&2
        exit 1
    fi
    if [[ "$actual_hash" != "$expected_hash" ]]; then
        rm -f "$temp_file" "$temp_sigstore" "$temp_checksums"
        echo -e "${RED}Error: SHA-256 mismatch for $binary_name (tool: $sha_tool).${NC}" >&2
        echo "  expected: $expected_hash" >&2
        echo "  actual:   $actual_hash" >&2
        echo "The downloaded binary does NOT match the release-published checksum." >&2
        echo "Refusing to install. Re-run after investigating the release surface." >&2
        exit 1
    fi
    echo -e "${GREEN}SHA-256 verified ($sha_tool):${NC} $actual_hash" >&2
    # ---- Cosign keyless verify (OPTIONAL — warn-mode if absent) ----
    #
    # The release publishes <binary>.sigstore.json next to each binary,
    # signed via sigstore/cosign-installer keyless mode against the
    # GitHub Actions OIDC identity for the certctl-io/certctl repo
    # (see .github/workflows/release.yml). Cosign verify with the
    # certificate-identity-regexp + certificate-oidc-issuer pair
    # pins the signature to the repo's release workflow — a malicious
    # asset signed under a different identity fails the verify.
    if command -v cosign &> /dev/null; then
        echo -e "${YELLOW}Cosign keyless-verifying binary signature...${NC}" >&2
        if ! curl -sSL -f "${download_url}.sigstore.json" -o "$temp_sigstore" >&2; then
            rm -f "$temp_file" "$temp_sigstore" "$temp_checksums"
            echo -e "${RED}Error: Failed to download cosign signature from ${download_url}.sigstore.json.${NC}" >&2
            echo "Either the release surface is broken or this binary predates the cosign-signed releases. Refusing to install." >&2
            exit 1
        fi
        if ! COSIGN_EXPERIMENTAL=1 cosign verify-blob \
                --bundle "$temp_sigstore" \
                --certificate-identity-regexp "^https://github.com/${GITHUB_REPO}/" \
                --certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
                "$temp_file" >&2; then
            rm -f "$temp_file" "$temp_sigstore" "$temp_checksums"
            echo -e "${RED}Error: cosign verify-blob failed for $binary_name.${NC}" >&2
            echo "The binary is NOT signed by the expected GitHub Actions OIDC identity." >&2
            echo "Refusing to install. This is the load-bearing supply-chain check." >&2
            exit 1
        fi
        echo -e "${GREEN}Cosign signature verified${NC} (identity matches ${GITHUB_REPO} release workflow)" >&2
    else
        echo -e "${YELLOW}WARNING:${NC} cosign is not installed — SKIPPING signature verification." >&2
        echo "  SHA-256 verification above is still in force, but the cosign signature" >&2
        echo "  ties the binary to the certctl-io/certctl release workflow's OIDC" >&2
        echo "  identity — the load-bearing supply-chain check. Operators in regulated" >&2
        echo "  environments MUST install cosign and re-run:" >&2
        echo "    curl -sSL https://github.com/sigstore/cosign/releases/latest/download/cosign-${OS_TYPE}-${ARCH_TYPE} -o /usr/local/bin/cosign" >&2
        echo "    chmod +x /usr/local/bin/cosign" >&2
        echo "  Continuing with SHA-256 verification only." >&2
    fi
    rm -f "$temp_sigstore" "$temp_checksums"
    chmod +x "$temp_file"
    echo "$temp_file"
 }
@@ -28,6 +28,18 @@ type AuditService interface {
 	// empty string returns all categories. Used by the auditor role
 	// (filtered to "auth" via /v1/audit?category=auth).
 	ListAuditEventsByCategory(ctx context.Context, eventCategory string, page, perPage int) ([]domain.AuditEvent, int64, error)
 	// ListAuditEventsByFilter (P-H2 closure, frontend-design-audit
 	// 2026-05-14) returns audit rows constrained by an optional time
 	// range AND optional category. Zero time.Time on either bound
 	// disables that bound. The repository already pushes the
 	// predicate into SQL (timestamp >=/<= since/until); this method
 	// just threads handler-parsed `since` / `until` query params
 	// through to the filter. Frontend (AuditPage) drops the pre-P-H2
 	// client-side time filter ("fetches the entire event window,
 	// throws 99% away in JS") and sends since/until directly. MCP's
 	// certctl_audit_list_with_category tool already advertised these
 	// params; this closure makes that advertised contract truthful.
 	ListAuditEventsByFilter(ctx context.Context, since, until time.Time, eventCategory string, page, perPage int) ([]domain.AuditEvent, int64, error)
 	// ExportEventsByFilter returns audit events matching a
 	// (from, to, eventCategory) filter, capped at maxRows. Audit
 	// 2026-05-10 HIGH-11 closure — backs the new
@@ -53,12 +65,29 @@ func NewAuditHandler(svc AuditService) AuditHandler {
 }
 // ListAuditEvents lists audit events.
-// GET /api/v1/audit?page=1&per_page=50&category=auth
+// GET /api/v1/audit?page=1&per_page=50&category=auth&since=<RFC3339>&until=<RFC3339>
 //
-// Bundle 1 Phase 8 adds the optional `category` query parameter for
+// Bundle 1 Phase 8 added the optional `category` query parameter for
 // auditor-role filtering. Allowed values: cert_lifecycle, auth, config.
 // Unknown values surface 400 so misuse is caught loud (instead of
 // silently returning all rows).
 //
 // P-H2 closure (frontend-design-audit 2026-05-14) adds the optional
 // `since` / `until` time-range query parameters. Both accept RFC3339
 // (e.g. "2026-04-01T00:00:00Z"). Either bound can be omitted to leave
 // that side open-ended. The repository already pushes the timestamp
 // predicate into the SQL query, and migration 000032's
 // (event_category, timestamp DESC) composite index makes the
 // predicate hit an index scan rather than a sequential scan.
 //
 // Note on naming: this endpoint uses `since` / `until` to match the
 // existing MCP `certctl_audit_list_with_category` tool's published
 // contract (internal/mcp/tools_audit_fix.go:174) and the audit-text
 // framing of the P-H2 finding. The sibling /api/v1/audit/export
 // endpoint uses `from` / `to` for compliance-window semantics
 // (required, ≤ 90-day range, NDJSON streaming); the two endpoints
 // share data but have different param semantics and the names were
 // chosen to reflect that.
 func (h AuditHandler) ListAuditEvents(w http.ResponseWriter, r *http.Request) {
 	if r.Method != http.MethodGet {
 		Error(w, http.StatusMethodNotAllowed, "Method not allowed")
@@ -93,16 +122,39 @@ func (h AuditHandler) ListAuditEvents(w http.ResponseWriter, r *http.Request) {
 		}
 	}
-	var (
+	// P-H2: optional time-range bounds. RFC3339 parse with explicit
-		events []domain.AuditEvent
+	// 400 on malformed input — silently dropping a malformed `since`
-		total  int64
+	// would be worse than rejecting it (operator gets unfiltered
-		err    error
+	// results when they thought they were filtering).
-	)
+	var since, until time.Time
-	if category != "" {
+	if s := query.Get("since"); s != "" {
-		events, total, err = h.svc.ListAuditEventsByCategory(r.Context(), category, page, perPage)
+		parsed, err := time.Parse(time.RFC3339, s)
-	} else {
+		if err != nil {
-		events, total, err = h.svc.ListAuditEvents(r.Context(), page, perPage)
+			ErrorWithRequestID(w, http.StatusBadRequest,
 				"`since` must be RFC3339 (e.g. 2026-04-01T00:00:00Z)",
 				requestID)
 			return
 		}
 		since = parsed
 	}
 	if u := query.Get("until"); u != "" {
 		parsed, err := time.Parse(time.RFC3339, u)
 		if err != nil {
 			ErrorWithRequestID(w, http.StatusBadRequest,
 				"`until` must be RFC3339 (e.g. 2026-05-01T00:00:00Z)",
 				requestID)
 			return
 		}
 		until = parsed
 	}
 	if !since.IsZero() && !until.IsZero() && !until.After(since) {
 		ErrorWithRequestID(w, http.StatusBadRequest,
 			"`until` must be after `since`",
 			requestID)
 		return
 	}
 	events, total, err := h.svc.ListAuditEventsByFilter(r.Context(), since, until, category, page, perPage)
 	if err != nil {
 		ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to list audit events", requestID)
 		return
@@ -17,11 +17,16 @@ import (
 type mockAuditService struct {
 	listFunc       func(page, perPage int) ([]domain.AuditEvent, int64, error)
 	listByCatFunc  func(category string, page, perPage int) ([]domain.AuditEvent, int64, error)
 	listByFiltFunc func(since, until time.Time, category string, page, perPage int) ([]domain.AuditEvent, int64, error)
 	getFunc        func(id string) (*domain.AuditEvent, error)
 	// HIGH-11 self-audit trace — last RecordEventWithCategory call.
 	lastAuditActor    string
 	lastAuditAction   string
 	lastAuditCategory string
 	// P-H2 trace — last ListAuditEventsByFilter args.
 	lastFilterSince    time.Time
 	lastFilterUntil    time.Time
 	lastFilterCategory string
 }
 func (m *mockAuditService) ListAuditEvents(_ context.Context, page, perPage int) ([]domain.AuditEvent, int64, error) {
@@ -41,6 +46,27 @@ func (m *mockAuditService) ListAuditEventsByCategory(_ context.Context, category
 	return nil, 0, nil
 }
 // ListAuditEventsByFilter satisfies the P-H2 interface extension. The
 // test fixture remembers the (since, until, category) tuple so
 // per-subtest assertions can pin that the handler threaded the
 // query-string params through correctly. Falls back to listFunc /
 // listByCatFunc so existing tests don't need to set listByFiltFunc.
 func (m *mockAuditService) ListAuditEventsByFilter(_ context.Context, since, until time.Time, category string, page, perPage int) ([]domain.AuditEvent, int64, error) {
 	m.lastFilterSince = since
 	m.lastFilterUntil = until
 	m.lastFilterCategory = category
 	if m.listByFiltFunc != nil {
 		return m.listByFiltFunc(since, until, category, page, perPage)
 	}
 	if category != "" && m.listByCatFunc != nil {
 		return m.listByCatFunc(category, page, perPage)
 	}
 	if m.listFunc != nil {
 		return m.listFunc(page, perPage)
 	}
 	return nil, 0, nil
 }
 func (m *mockAuditService) GetAuditEvent(_ context.Context, id string) (*domain.AuditEvent, error) {
 	if m.getFunc != nil {
 		return m.getFunc(id)
@@ -325,6 +351,153 @@ func TestListAuditEvents_MethodNotAllowed(t *testing.T) {
 	}
 }
 // ── P-H2 closure (since / until time-range query params) ───────────
 // TestListAuditEvents_WithSinceUntil pins the happy path — both bounds
 // supplied in RFC3339, mock observes them threaded into the service
 // call, response is 200.
 func TestListAuditEvents_WithSinceUntil(t *testing.T) {
 	since := time.Date(2026, 4, 1, 0, 0, 0, 0, time.UTC)
 	until := time.Date(2026, 5, 1, 0, 0, 0, 0, time.UTC)
 	mockSvc := &mockAuditService{
 		listByFiltFunc: func(s, u time.Time, _ string, _, _ int) ([]domain.AuditEvent, int64, error) {
 			if !s.Equal(since) {
 				t.Errorf("service since = %v, want %v", s, since)
 			}
 			if !u.Equal(until) {
 				t.Errorf("service until = %v, want %v", u, until)
 			}
 			return []domain.AuditEvent{}, 0, nil
 		},
 	}
 	handler := NewAuditHandler(mockSvc)
 	url := "/api/v1/audit?since=" + since.Format(time.RFC3339) + "&until=" + until.Format(time.RFC3339)
 	req, err := http.NewRequest(http.MethodGet, url, nil)
 	if err != nil {
 		t.Fatalf("NewRequest failed: %v", err)
 	}
 	ctx := context.WithValue(req.Context(), middleware.RequestIDKey{}, "test-req-id")
 	req = req.WithContext(ctx)
 	w := httptest.NewRecorder()
 	handler.ListAuditEvents(w, req)
 	if w.Code != http.StatusOK {
 		t.Errorf("status = %d, want 200; body=%s", w.Code, w.Body.String())
 	}
 	if !mockSvc.lastFilterSince.Equal(since) {
 		t.Errorf("mock recorded since = %v, want %v", mockSvc.lastFilterSince, since)
 	}
 	if !mockSvc.lastFilterUntil.Equal(until) {
 		t.Errorf("mock recorded until = %v, want %v", mockSvc.lastFilterUntil, until)
 	}
 }
 // TestListAuditEvents_SinceOnly pins one-sided bound — only `since`
 // supplied, `until` stays zero. Closure of "operator filters to events
 // from the last hour" via since=<now-1h>.
 func TestListAuditEvents_SinceOnly(t *testing.T) {
 	since := time.Date(2026, 4, 1, 0, 0, 0, 0, time.UTC)
 	mockSvc := &mockAuditService{}
 	handler := NewAuditHandler(mockSvc)
 	req, _ := http.NewRequest(http.MethodGet, "/api/v1/audit?since="+since.Format(time.RFC3339), nil)
 	ctx := context.WithValue(req.Context(), middleware.RequestIDKey{}, "test-req-id")
 	req = req.WithContext(ctx)
 	w := httptest.NewRecorder()
 	handler.ListAuditEvents(w, req)
 	if w.Code != http.StatusOK {
 		t.Errorf("status = %d, want 200; body=%s", w.Code, w.Body.String())
 	}
 	if !mockSvc.lastFilterSince.Equal(since) {
 		t.Errorf("since = %v, want %v", mockSvc.lastFilterSince, since)
 	}
 	if !mockSvc.lastFilterUntil.IsZero() {
 		t.Errorf("until = %v, want zero (open-ended)", mockSvc.lastFilterUntil)
 	}
 }
 // TestListAuditEvents_InvalidSince pins the parse-error 400 path.
 // Silently dropping a malformed since would return ALL rows when the
 // operator thought they were filtering — worse than rejecting.
 func TestListAuditEvents_InvalidSince(t *testing.T) {
 	mockSvc := &mockAuditService{}
 	handler := NewAuditHandler(mockSvc)
 	req, _ := http.NewRequest(http.MethodGet, "/api/v1/audit?since=not-a-date", nil)
 	ctx := context.WithValue(req.Context(), middleware.RequestIDKey{}, "test-req-id")
 	req = req.WithContext(ctx)
 	w := httptest.NewRecorder()
 	handler.ListAuditEvents(w, req)
 	if w.Code != http.StatusBadRequest {
 		t.Errorf("status = %d, want 400; body=%s", w.Code, w.Body.String())
 	}
 	if !mockSvc.lastFilterSince.IsZero() {
 		t.Error("service should NOT have been called on bad since")
 	}
 }
 // TestListAuditEvents_UntilBeforeSince pins the order assertion — a
 // reversed range surfaces 400, doesn't quietly return empty.
 func TestListAuditEvents_UntilBeforeSince(t *testing.T) {
 	since := time.Date(2026, 5, 1, 0, 0, 0, 0, time.UTC)
 	until := time.Date(2026, 4, 1, 0, 0, 0, 0, time.UTC)
 	mockSvc := &mockAuditService{}
 	handler := NewAuditHandler(mockSvc)
 	url := "/api/v1/audit?since=" + since.Format(time.RFC3339) + "&until=" + until.Format(time.RFC3339)
 	req, _ := http.NewRequest(http.MethodGet, url, nil)
 	ctx := context.WithValue(req.Context(), middleware.RequestIDKey{}, "test-req-id")
 	req = req.WithContext(ctx)
 	w := httptest.NewRecorder()
 	handler.ListAuditEvents(w, req)
 	if w.Code != http.StatusBadRequest {
 		t.Errorf("status = %d, want 400; body=%s", w.Code, w.Body.String())
 	}
 }
 // TestListAuditEvents_TimeRangePlusCategory pins that since/until
 // compose with category (the auditor-role narrow-to-auth use case
 // extended to "auth events from yesterday" without a separate
 // endpoint).
 func TestListAuditEvents_TimeRangePlusCategory(t *testing.T) {
 	since := time.Date(2026, 4, 1, 0, 0, 0, 0, time.UTC)
 	until := time.Date(2026, 5, 1, 0, 0, 0, 0, time.UTC)
 	mockSvc := &mockAuditService{}
 	handler := NewAuditHandler(mockSvc)
 	url := "/api/v1/audit?category=auth&since=" + since.Format(time.RFC3339) + "&until=" + until.Format(time.RFC3339)
 	req, _ := http.NewRequest(http.MethodGet, url, nil)
 	ctx := context.WithValue(req.Context(), middleware.RequestIDKey{}, "test-req-id")
 	req = req.WithContext(ctx)
 	w := httptest.NewRecorder()
 	handler.ListAuditEvents(w, req)
 	if w.Code != http.StatusOK {
 		t.Errorf("status = %d, want 200; body=%s", w.Code, w.Body.String())
 	}
 	if mockSvc.lastFilterCategory != "auth" {
 		t.Errorf("category = %q, want auth", mockSvc.lastFilterCategory)
 	}
 	if !mockSvc.lastFilterSince.Equal(since) {
 		t.Errorf("since = %v, want %v", mockSvc.lastFilterSince, since)
 	}
 	if !mockSvc.lastFilterUntil.Equal(until) {
 		t.Errorf("until = %v, want %v", mockSvc.lastFilterUntil, until)
 	}
 }
 func TestGetAuditEvent_Success(t *testing.T) {
 	event := &domain.AuditEvent{
 		ID:           "ev-123",
@@ -78,7 +78,7 @@ type AuthBreakglassHandler struct {
 	// nil-safe: when unset, the handler skips the limiter check and
 	// relies on the service-layer Argon2id lockout. Production deploys
 	// MUST set this via SetLoginRateLimiter.
-	loginLimiter *ratelimit.SlidingWindowLimiter
+	loginLimiter ratelimit.Limiter
 }
 // NewAuthBreakglassHandler constructs the handler.
@@ -89,7 +89,7 @@ func NewAuthBreakglassHandler(svc BreakglassService, cookieAttrs SessionCookieAt
 // SetLoginRateLimiter wires the per-source-IP rate limiter the Login
 // handler enforces. Bundle 5 closure (S1) — see the AuthBreakglassHandler
 // type docstring for the full rationale.
-func (h *AuthBreakglassHandler) SetLoginRateLimiter(l *ratelimit.SlidingWindowLimiter) {
+func (h *AuthBreakglassHandler) SetLoginRateLimiter(l ratelimit.Limiter) {
 	h.loginLimiter = l
 }
@@ -0,0 +1,232 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package handler
 import (
 	"context"
 	"encoding/base64"
 	"encoding/json"
 	"errors"
 	"fmt"
 	"strings"
 	"time"
 	gooidc "github.com/coreos/go-oidc/v3/oidc"
 	oidcsvc "github.com/certctl-io/certctl/internal/auth/oidc"
 	oidcdomain "github.com/certctl-io/certctl/internal/auth/oidc/domain"
 	"github.com/certctl-io/certctl/internal/repository"
 )
 // Phase 9 ARCH-M2 closure Sprint 11 (2026-05-14): extracted from
 // internal/api/handler/auth_session_oidc.go via the Option B
 // sibling-file pattern.
 //
 // This file holds the DefaultBCLVerifier — the default
 // implementation of the BackChannelLogoutVerifier interface
 // declared in auth_session_oidc.go. Verifies an OIDC
 // back-channel-logout token per OpenID Connect Back-Channel
 // Logout 1.0 §2.6: enforces the events claim, iat window,
 // algorithm allowlist, audience match against the provider's
 // configured client ID, and decodes sub/sid/jti for the
 // revocation lookup.
 //
 // External callers:
 //   - cmd/server/main.go wires NewDefaultBCLVerifier(...) +
 //     DefaultBCLVerifierMaxAge into the AuthSessionOIDCHandler
 //     via WithBCLReplayConsumer.
 //
 // peekIssuer (unexported) is consumed only by Verify so it moves
 // with the verifier. The go-oidc/v3 client is the underlying JWS
 // verification + IdP-key-cache; everything else here is policy.
 // =============================================================================
 // Default BackChannelLogoutVerifier — wraps go-oidc/v3.
 // =============================================================================
 // DefaultBCLVerifierMaxAge is the default iat-freshness skew window
 // (60 seconds; tokens older or newer than this are rejected). Override
 // per-server via CERTCTL_OIDC_BCL_MAX_AGE_SECONDS. Audit 2026-05-10
 // HIGH-3 closure.
 const DefaultBCLVerifierMaxAge = 60 * time.Second
 // DefaultBCLVerifier is the production BackChannelLogoutVerifier. It
 // resolves the IdP by issuer (matched against the OIDCProviderRepository),
 // fetches the IdP's JWKS via gooidc.Provider, and validates the
 // logout_token JWT signature + required claims.
 type DefaultBCLVerifier struct {
 	providerRepo repository.OIDCProviderRepository
 	tenantID     string
 	allowedAlgs  []string
 	// maxAge is the iat-freshness skew window. Tokens with iat in the
 	// past beyond this OR in the future beyond this are rejected. Set
 	// via WithMaxAge; defaults to DefaultBCLVerifierMaxAge.
 	maxAge time.Duration
 	// nowFn is the clock seam (test injection).
 	nowFn func() time.Time
 	// Injectable for tests so unit tests don't hit a real IdP.
 	verifyOverride func(ctx context.Context, providerIssuer, rawIDToken string) (*gooidc.IDToken, error)
 }
 // NewDefaultBCLVerifier constructs a verifier wired against the given
 // provider repo + tenant.
 func NewDefaultBCLVerifier(providerRepo repository.OIDCProviderRepository, tenantID string, allowedAlgs []string) *DefaultBCLVerifier {
 	if len(allowedAlgs) == 0 {
 		allowedAlgs = []string{
 			gooidc.RS256, gooidc.RS512, gooidc.ES256, gooidc.ES384, gooidc.EdDSA,
 		}
 	}
 	return &DefaultBCLVerifier{
 		providerRepo: providerRepo,
 		tenantID:     tenantID,
 		allowedAlgs:  allowedAlgs,
 		maxAge:       DefaultBCLVerifierMaxAge,
 		nowFn:        time.Now,
 	}
 }
 // WithMaxAge returns a copy of the verifier with the iat-skew window
 // overridden. Audit 2026-05-10 HIGH-3 — operator-configurable via
 // CERTCTL_OIDC_BCL_MAX_AGE_SECONDS at cmd/server/main.go.
 func (v *DefaultBCLVerifier) WithMaxAge(d time.Duration) *DefaultBCLVerifier {
 	v.maxAge = d
 	return v
 }
 // Verify implements BackChannelLogoutVerifier.
 func (v *DefaultBCLVerifier) Verify(ctx context.Context, logoutToken string) (issuer, sub, sid, jti string, iat int64, err error) {
 	// We don't know which provider the logout_token came from until we
 	// peek at the iss claim. Parse-without-verify, look up the matching
 	// provider, then verify against that provider's JWKS.
 	iss, peekErr := peekIssuer(logoutToken)
 	if peekErr != nil {
 		return "", "", "", "", 0, fmt.Errorf("peek issuer: %w", peekErr)
 	}
 	provs, lerr := v.providerRepo.List(ctx, v.tenantID)
 	if lerr != nil {
 		return "", "", "", "", 0, fmt.Errorf("list providers: %w", lerr)
 	}
 	var matched *oidcdomain.OIDCProvider
 	for _, p := range provs {
 		if p.IssuerURL == iss {
 			matched = p
 			break
 		}
 	}
 	if matched == nil {
 		return "", "", "", "", 0, fmt.Errorf("no provider configured for issuer %q", iss)
 	}
 	var idToken *gooidc.IDToken
 	if v.verifyOverride != nil {
 		idToken, err = v.verifyOverride(ctx, matched.IssuerURL, logoutToken)
 	} else {
 		// Acquisition-audit SEC-021 closure (Sprint 1 follow-up to SEC-001,
 		// 2026-05-16). Per-request discovery re-fetch threaded through
 		// SafeOIDCContext so the dial-time SSRF guard
 		// (validation.SafeHTTPDialContext) re-resolves the issuer host and
 		// refuses reserved-address answers — matching the SEC-001 sweep
 		// over the runtime + dry-run discovery legs in internal/auth/oidc.
 		provider, perr := gooidc.NewProvider(oidcsvc.SafeOIDCContext(ctx), matched.IssuerURL)
 		if perr != nil {
 			return "", "", "", "", 0, fmt.Errorf("provider discovery: %w", perr)
 		}
 		verifier := provider.Verifier(&gooidc.Config{
 			ClientID:             matched.ClientID,
 			SupportedSigningAlgs: v.allowedAlgs,
 			SkipExpiryCheck:      true, // OIDC BCL §2.4 — no exp claim required
 		})
 		idToken, err = verifier.Verify(ctx, logoutToken)
 	}
 	if err != nil {
 		return "", "", "", "", 0, fmt.Errorf("verify: %w", err)
 	}
 	// Required claims per spec §2.4.
 	var claims struct {
 		Iss    string                 `json:"iss"`
 		Aud    interface{}            `json:"aud"`
 		Iat    int64                  `json:"iat"`
 		Jti    string                 `json:"jti"`
 		Events map[string]interface{} `json:"events"`
 		Sub    string                 `json:"sub"`
 		Sid    string                 `json:"sid"`
 		Nonce  string                 `json:"nonce"`
 	}
 	if cerr := idToken.Claims(&claims); cerr != nil {
 		return "", "", "", "", 0, fmt.Errorf("claims unmarshal: %w", cerr)
 	}
 	if claims.Iat == 0 {
 		return "", "", "", "", 0, errors.New("missing iat claim")
 	}
 	// Audit 2026-05-10 HIGH-3 — iat freshness check. Reject tokens
 	// whose iat is outside the skew window. RFC 9700 §2.7 + the
 	// existing ID-token-path skew tolerance (oidc/service.go:463).
 	maxAge := v.maxAge
 	if maxAge <= 0 {
 		maxAge = DefaultBCLVerifierMaxAge
 	}
 	now := v.nowFn().UTC()
 	iatTime := time.Unix(claims.Iat, 0).UTC()
 	if iatTime.After(now.Add(maxAge)) {
 		return "", "", "", "", 0, fmt.Errorf("iat is in the future beyond max-age %s", maxAge)
 	}
 	if now.Sub(iatTime) > maxAge {
 		return "", "", "", "", 0, fmt.Errorf("iat is stale (age %s > max-age %s)", now.Sub(iatTime), maxAge)
 	}
 	if claims.Jti == "" {
 		return "", "", "", "", 0, errors.New("missing jti claim")
 	}
 	if claims.Events == nil {
 		return "", "", "", "", 0, errors.New("missing events claim")
 	}
 	if _, ok := claims.Events["http://schemas.openid.net/event/backchannel-logout"]; !ok {
 		return "", "", "", "", 0, errors.New("events claim missing back-channel-logout URI")
 	}
 	if claims.Nonce != "" {
 		// Spec §2.4: nonce MUST NOT be present.
 		return "", "", "", "", 0, errors.New("nonce claim must be absent in logout_token")
 	}
 	if claims.Sub == "" && claims.Sid == "" {
 		return "", "", "", "", 0, errors.New("logout_token must carry sub or sid")
 	}
 	return claims.Iss, claims.Sub, claims.Sid, claims.Jti, claims.Iat, nil
 }
 // peekIssuer base64-decodes the JWT payload (segment 1 after the `.`)
 // and pulls the `iss` claim out without verifying the signature. Used
 // to find the matching provider before we know which JWKS to use.
 // peekIssuer extracts the `iss` claim from an unsigned JWT payload —
 // used by the BCL handler to route the logout_token to the right
 // provider for verification.
 //
 // Audit 2026-05-10 Nit-3 — peekIssuer is INTENTIONALLY unsigned-permissive.
 // The returned issuer is used ONLY to select the verifier; the full
 // signature + claim verification happens in DefaultBCLVerifier.Verify
 // (which re-checks the `iss` claim against the matched provider's
 // IssuerURL after JWS signature validation). Callers MUST NOT trust
 // peekIssuer output for any access-control decision before the verify
 // step completes; the pin is encoded in the BCL handler's call shape
 // (peek → match provider → verify-against-provider → consume).
 func peekIssuer(jwt string) (string, error) {
 	parts := strings.Split(jwt, ".")
 	if len(parts) != 3 {
 		return "", errors.New("expected 3 JWT segments")
 	}
 	payload, err := base64.RawURLEncoding.DecodeString(parts[1])
 	if err != nil {
 		return "", fmt.Errorf("payload base64: %w", err)
 	}
 	var c struct {
 		Iss string `json:"iss"`
 	}
 	if jerr := json.Unmarshal(payload, &c); jerr != nil {
 		return "", fmt.Errorf("payload json: %w", jerr)
 	}
 	if c.Iss == "" {
 		return "", errors.New("missing iss claim in payload")
 	}
 	return c.Iss, nil
 }
@@ -0,0 +1,77 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package handler
 import (
 	"context"
 	"encoding/base64"
 	"strings"
 	"testing"
 	oidcdomain "github.com/certctl-io/certctl/internal/auth/oidc/domain"
 )
 // Acquisition-audit SEC-021 closure (Sprint 1 follow-up to SEC-001,
 // 2026-05-16). DefaultBCLVerifier.Verify performs a per-request
 // discovery re-fetch via gooidc.NewProvider(ctx, matched.IssuerURL).
 // Pre-fix, the bare ctx fell through to http.DefaultClient at the dial
 // layer — no SSRF guard, no DNS-rebinding re-resolve. The fix wraps
 // ctx via oidcsvc.SafeOIDCContext so the dial-time
 // validation.SafeHTTPDialContext refuses reserved-address answers
 // (loopback / link-local / cloud-metadata).
 //
 // This test pins the wrap end-to-end:
 //
 //  1. Construct a stubProviderRepo with one provider whose IssuerURL is
 //     a literal-loopback http:// URL (the literal-IP class that
 //     SafeHTTPDialContext.isReservedIPForDial refuses up-front, before
 //     any DNS resolution attempt).
 //  2. Hand-roll a 3-segment JWT whose payload base64url-decodes to
 //     {"iss":"<loopback url>"} so peekIssuer extracts the matching
 //     issuer and provs.List() returns the seeded provider.
 //  3. Call Verify. The discovery NewProvider call now routes through
 //     SafeOIDCContext; SafeHTTPDialContext sees the literal 127.0.0.1
 //     and refuses with "refusing to dial reserved address <ip>".
 //  4. Assert the returned error wraps that rejection (substring match
 //     on "refusing to dial" / "reserved address") rather than a
 //     generic connect-refused or "did not respond" wrap.
 //
 // Companion to TestFetchUserinfoGroups_SSRF_BlocksReservedAddress in
 // internal/auth/oidc/service_test.go which exercises the same wrap on
 // the userinfo-fallback leg. Together they pin the post-SEC-001 sweep.
 func TestDefaultBCLVerifier_SSRF_BlocksReservedAddress(t *testing.T) {
 	// Literal-loopback issuer URL. Port :1 keeps the URL syntactically
 	// valid; SafeHTTPDialContext refuses on the literal-IP check before
 	// the dial-time TCP connect, so the port choice is moot.
 	const reservedIssuer = "http://127.0.0.1:1"
 	provs := &stubProviderRepo{
 		provs: []*oidcdomain.OIDCProvider{
 			{ID: "op-loopback", IssuerURL: reservedIssuer, ClientID: "test-client"},
 		},
 	}
 	v := NewDefaultBCLVerifier(provs, "t-default", nil)
 	// Hand-roll the JWT. peekIssuer (see auth_session_oidc_bcl.go) parses
 	// only the iss claim from the 2nd segment (payload), so the header +
 	// signature segments only need to be syntactically present.
 	header := base64.RawURLEncoding.EncodeToString([]byte(`{"alg":"RS256"}`))
 	payload := base64.RawURLEncoding.EncodeToString([]byte(`{"iss":"` + reservedIssuer + `"}`))
 	logoutToken := header + "." + payload + ".sig"
 	_, _, _, _, _, err := v.Verify(context.Background(), logoutToken)
 	if err == nil {
 		t.Fatal("Verify against literal-loopback issuer URL: expected SSRF reject; got nil")
 	}
 	msg := err.Error()
 	if !strings.Contains(msg, "refusing to dial") && !strings.Contains(msg, "reserved address") {
 		t.Errorf("Verify err = %q; want SafeHTTPDialContext reserved-address rejection", msg)
 	}
 	// Also confirm the error is wrapped through the Verify "provider
 	// discovery:" prefix so callers can distinguish a discovery-time
 	// dial failure from a signature-verification failure.
 	if !strings.Contains(msg, "provider discovery") {
 		t.Errorf("Verify err = %q; want \"provider discovery:\" wrap", msg)
 	}
 }
@@ -0,0 +1,469 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package handler
 import (
 	"context"
 	"encoding/json"
 	"errors"
 	"net/http"
 	"strings"
 	"time"
 	oidcsvc "github.com/certctl-io/certctl/internal/auth/oidc"
 	oidcdomain "github.com/certctl-io/certctl/internal/auth/oidc/domain"
 	"github.com/certctl-io/certctl/internal/repository"
 )
 // Phase 9 ARCH-M2 closure Sprint 11 (2026-05-14): extracted from
 // internal/api/handler/auth_session_oidc.go via the Option B
 // sibling-file pattern.
 //
 // This file holds Section 3 of the original three-section layout:
 // OIDC PROVIDER + GROUP-MAPPING CRUD (RBAC-gated). Eight
 // endpoints across two related resources:
 //
 //   GET    /api/v1/auth/oidc/providers            -> auth.oidc.list
 //   POST   /api/v1/auth/oidc/providers            -> auth.oidc.create
 //   PUT    /api/v1/auth/oidc/providers/{id}       -> auth.oidc.edit
 //   DELETE /api/v1/auth/oidc/providers/{id}       -> auth.oidc.delete
 //   POST   /api/v1/auth/oidc/providers/{id}/test  -> auth.oidc.edit
 //   POST   /api/v1/auth/oidc/providers/{id}/refresh -> auth.oidc.edit
 //   GET    /api/v1/auth/oidc/group-mappings       -> auth.oidc.list
 //   POST   /api/v1/auth/oidc/group-mappings       -> auth.oidc.edit
 //   DELETE /api/v1/auth/oidc/group-mappings/{id}  -> auth.oidc.edit
 //
 // The four request/response projection types (oidcProviderRequest,
 // oidcProviderResponse, groupMappingRequest, groupMappingResponse)
 // move with their handler callers. The encryptClientSecret +
 // recordAudit + randomB64URLForHandler + defaultIfBlank +
 // defaultIntIfZero helpers stay in auth_session_oidc.go — they're
 // also consumed elsewhere (recordAudit is used by every section)
 // or are generic utilities that don't have a single owner.
 //
 // NOTE: the audit's verb-based prescription (login / callback /
 // refresh / logout / backchannel) named "refresh" as a separate
 // sibling file. The RefreshProvider handler here is the only
 // "refresh" in this file, but operationally it's an ADMIN
 // operation on a provider's signing-key cache, not a session
 // refresh. Sprint 11 keeps it grouped with the rest of the
 // provider CRUD where it belongs by call-graph + permission scope
 // (auth.oidc.edit, the same RBAC permission as Update/Delete).
 // =============================================================================
 // 3. OIDC provider + group-mapping CRUD.
 // =============================================================================
 type oidcProviderResponse struct {
 	ID                  string   `json:"id"`
 	TenantID            string   `json:"tenant_id"`
 	Name                string   `json:"name"`
 	IssuerURL           string   `json:"issuer_url"`
 	ClientID            string   `json:"client_id"`
 	RedirectURI         string   `json:"redirect_uri"`
 	GroupsClaimPath     string   `json:"groups_claim_path"`
 	GroupsClaimFormat   string   `json:"groups_claim_format"`
 	FetchUserinfo       bool     `json:"fetch_userinfo"`
 	Scopes              []string `json:"scopes"`
 	AllowedEmailDomains []string `json:"allowed_email_domains"`
 	IATWindowSeconds    int      `json:"iat_window_seconds"`
 	JWKSCacheTTLSeconds int      `json:"jwks_cache_ttl_seconds"`
 	CreatedAt           string   `json:"created_at"`
 	UpdatedAt           string   `json:"updated_at"`
 }
 func providerToResponse(p *oidcdomain.OIDCProvider) oidcProviderResponse {
 	return oidcProviderResponse{
 		ID: p.ID, TenantID: p.TenantID, Name: p.Name,
 		IssuerURL: p.IssuerURL, ClientID: p.ClientID, RedirectURI: p.RedirectURI,
 		GroupsClaimPath: p.GroupsClaimPath, GroupsClaimFormat: p.GroupsClaimFormat,
 		FetchUserinfo: p.FetchUserinfo, Scopes: p.Scopes, AllowedEmailDomains: p.AllowedEmailDomains,
 		IATWindowSeconds: p.IATWindowSeconds, JWKSCacheTTLSeconds: p.JWKSCacheTTLSeconds,
 		CreatedAt: p.CreatedAt.UTC().Format(time.RFC3339),
 		UpdatedAt: p.UpdatedAt.UTC().Format(time.RFC3339),
 	}
 }
 type oidcProviderRequest struct {
 	Name                string   `json:"name"`
 	IssuerURL           string   `json:"issuer_url"`
 	ClientID            string   `json:"client_id"`
 	ClientSecret        string   `json:"client_secret"` // plaintext on the wire ONLY at create/update; encrypted at rest
 	RedirectURI         string   `json:"redirect_uri"`
 	GroupsClaimPath     string   `json:"groups_claim_path"`
 	GroupsClaimFormat   string   `json:"groups_claim_format"`
 	FetchUserinfo       bool     `json:"fetch_userinfo"`
 	Scopes              []string `json:"scopes"`
 	AllowedEmailDomains []string `json:"allowed_email_domains"`
 	IATWindowSeconds    int      `json:"iat_window_seconds"`
 	JWKSCacheTTLSeconds int      `json:"jwks_cache_ttl_seconds"`
 }
 // ListProviders handles GET /api/v1/auth/oidc/providers.
 func (h *AuthSessionOIDCHandler) ListProviders(w http.ResponseWriter, r *http.Request) {
 	if _, err := callerFromRequest(r); err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	provs, err := h.providerRepo.List(r.Context(), h.tenantID)
 	if err != nil {
 		Error(w, http.StatusInternalServerError, "could not list providers")
 		return
 	}
 	out := make([]oidcProviderResponse, 0, len(provs))
 	for _, p := range provs {
 		out = append(out, providerToResponse(p))
 	}
 	writeJSON(w, http.StatusOK, map[string]interface{}{"providers": out})
 }
 // CreateProvider handles POST /api/v1/auth/oidc/providers.
 func (h *AuthSessionOIDCHandler) CreateProvider(w http.ResponseWriter, r *http.Request) {
 	caller, err := callerFromRequest(r)
 	if err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	var req oidcProviderRequest
 	if derr := json.NewDecoder(r.Body).Decode(&req); derr != nil {
 		Error(w, http.StatusBadRequest, "invalid JSON body")
 		return
 	}
 	if strings.TrimSpace(req.ClientSecret) == "" {
 		Error(w, http.StatusBadRequest, "client_secret is required")
 		return
 	}
 	encrypted, eerr := h.encryptClientSecret([]byte(req.ClientSecret))
 	if eerr != nil {
 		Error(w, http.StatusInternalServerError, "could not encrypt client secret")
 		return
 	}
 	prov := &oidcdomain.OIDCProvider{
 		ID:                    "op-" + randomB64URLForHandler(16),
 		TenantID:              h.tenantID,
 		Name:                  req.Name,
 		IssuerURL:             req.IssuerURL,
 		ClientID:              req.ClientID,
 		ClientSecretEncrypted: encrypted,
 		RedirectURI:           req.RedirectURI,
 		GroupsClaimPath:       defaultIfBlank(req.GroupsClaimPath, oidcdomain.DefaultGroupsClaimPath),
 		GroupsClaimFormat:     defaultIfBlank(req.GroupsClaimFormat, oidcdomain.GroupsClaimFormatStringArray),
 		FetchUserinfo:         req.FetchUserinfo,
 		Scopes:                req.Scopes,
 		AllowedEmailDomains:   req.AllowedEmailDomains,
 		IATWindowSeconds:      defaultIntIfZero(req.IATWindowSeconds, oidcdomain.DefaultIATWindowSeconds),
 		JWKSCacheTTLSeconds:   defaultIntIfZero(req.JWKSCacheTTLSeconds, oidcdomain.DefaultJWKSCacheTTLSeconds),
 	}
 	if verr := prov.Validate(); verr != nil {
 		Error(w, http.StatusBadRequest, verr.Error())
 		return
 	}
 	if cerr := h.providerRepo.Create(r.Context(), prov); cerr != nil {
 		if errors.Is(cerr, repository.ErrOIDCProviderDuplicateName) {
 			Error(w, http.StatusConflict, "provider name already exists")
 			return
 		}
 		Error(w, http.StatusInternalServerError, "could not create provider")
 		return
 	}
 	h.recordAudit(r.Context(), "auth.oidc_provider_created", caller.ActorID, caller.ActorType, prov.ID,
 		map[string]interface{}{"provider_id": prov.ID, "name": prov.Name, "issuer_url": prov.IssuerURL})
 	writeJSON(w, http.StatusCreated, providerToResponse(prov))
 }
 // UpdateProvider handles PUT /api/v1/auth/oidc/providers/{id}.
 func (h *AuthSessionOIDCHandler) UpdateProvider(w http.ResponseWriter, r *http.Request) {
 	caller, err := callerFromRequest(r)
 	if err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	id := r.PathValue("id")
 	if id == "" {
 		Error(w, http.StatusBadRequest, "missing provider id")
 		return
 	}
 	existing, gerr := h.providerRepo.Get(r.Context(), id)
 	if gerr != nil {
 		if errors.Is(gerr, repository.ErrOIDCProviderNotFound) {
 			Error(w, http.StatusNotFound, "provider not found")
 			return
 		}
 		Error(w, http.StatusInternalServerError, "could not load provider")
 		return
 	}
 	var req oidcProviderRequest
 	if derr := json.NewDecoder(r.Body).Decode(&req); derr != nil {
 		Error(w, http.StatusBadRequest, "invalid JSON body")
 		return
 	}
 	// Mutable fields only (id / tenant_id / created_at preserved).
 	existing.Name = req.Name
 	existing.IssuerURL = req.IssuerURL
 	existing.ClientID = req.ClientID
 	existing.RedirectURI = req.RedirectURI
 	existing.GroupsClaimPath = defaultIfBlank(req.GroupsClaimPath, existing.GroupsClaimPath)
 	existing.GroupsClaimFormat = defaultIfBlank(req.GroupsClaimFormat, existing.GroupsClaimFormat)
 	existing.FetchUserinfo = req.FetchUserinfo
 	existing.Scopes = req.Scopes
 	existing.AllowedEmailDomains = req.AllowedEmailDomains
 	if req.IATWindowSeconds != 0 {
 		existing.IATWindowSeconds = req.IATWindowSeconds
 	}
 	if req.JWKSCacheTTLSeconds != 0 {
 		existing.JWKSCacheTTLSeconds = req.JWKSCacheTTLSeconds
 	}
 	// Re-encrypt client_secret only if a new one is supplied; empty
 	// preserves the existing ciphertext.
 	if strings.TrimSpace(req.ClientSecret) != "" {
 		encrypted, eerr := h.encryptClientSecret([]byte(req.ClientSecret))
 		if eerr != nil {
 			Error(w, http.StatusInternalServerError, "could not encrypt client secret")
 			return
 		}
 		existing.ClientSecretEncrypted = encrypted
 	}
 	if verr := existing.Validate(); verr != nil {
 		Error(w, http.StatusBadRequest, verr.Error())
 		return
 	}
 	if uerr := h.providerRepo.Update(r.Context(), existing); uerr != nil {
 		Error(w, http.StatusInternalServerError, "could not update provider")
 		return
 	}
 	h.recordAudit(r.Context(), "auth.oidc_provider_updated", caller.ActorID, caller.ActorType, existing.ID,
 		map[string]interface{}{"provider_id": existing.ID, "name": existing.Name})
 	writeJSON(w, http.StatusOK, providerToResponse(existing))
 }
 // DeleteProvider handles DELETE /api/v1/auth/oidc/providers/{id}.
 // Refused when at least one user has authenticated via this provider.
 func (h *AuthSessionOIDCHandler) DeleteProvider(w http.ResponseWriter, r *http.Request) {
 	caller, err := callerFromRequest(r)
 	if err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	id := r.PathValue("id")
 	if id == "" {
 		Error(w, http.StatusBadRequest, "missing provider id")
 		return
 	}
 	if derr := h.providerRepo.Delete(r.Context(), id); derr != nil {
 		switch {
 		case errors.Is(derr, repository.ErrOIDCProviderNotFound):
 			Error(w, http.StatusNotFound, "provider not found")
 		case errors.Is(derr, repository.ErrOIDCProviderInUse):
 			Error(w, http.StatusConflict, "provider has authenticated users; revoke all sessions before delete")
 		default:
 			Error(w, http.StatusInternalServerError, "could not delete provider")
 		}
 		return
 	}
 	h.recordAudit(r.Context(), "auth.oidc_provider_deleted", caller.ActorID, caller.ActorType, id,
 		map[string]interface{}{"provider_id": id})
 	w.WriteHeader(http.StatusNoContent)
 }
 // TestProvider handles POST /api/v1/auth/oidc/test.
 //
 // Audit 2026-05-10 MED-5 closure. Dry-run validator for an OIDC
 // provider config: runs OIDC discovery, the alg-downgrade defense,
 // the RFC 9207 iss-parameter detection, and a JWKS fetch — without
 // persisting anything. Body: `{issuer_url, client_id, scopes}`
 // (client_secret accepted but ignored — discovery + JWKS don't
 // require it). Response: TestDiscoveryResult; HTTP 200 even when
 // individual checks fail (the response Errors field carries them so
 // the GUI can render per-check status rows).
 //
 // Permission gate: `auth.oidc.create` (the operator is dry-running a
 // provider they're about to create; the lookup endpoints have their
 // own .list gate so this can't be used as a roundabout reconnaissance
 // vector beyond what those already permit).
 func (h *AuthSessionOIDCHandler) TestProvider(w http.ResponseWriter, r *http.Request) {
 	caller, err := callerFromRequest(r)
 	if err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	var req struct {
 		IssuerURL    string   `json:"issuer_url"`
 		ClientID     string   `json:"client_id"`
 		ClientSecret string   `json:"client_secret"`
 		Scopes       []string `json:"scopes"`
 	}
 	if derr := json.NewDecoder(r.Body).Decode(&req); derr != nil {
 		Error(w, http.StatusBadRequest, "invalid JSON body")
 		return
 	}
 	if strings.TrimSpace(req.IssuerURL) == "" {
 		Error(w, http.StatusBadRequest, "issuer_url is required")
 		return
 	}
 	// Type-assert to the concrete service so we can reach the
 	// TestDiscovery method. The OIDCAuthHandshaker interface is
 	// intentionally narrow; rather than widening it (which would force
 	// every test stub to implement TestDiscovery) we accept the
 	// concrete reference for this single endpoint. Production code
 	// always supplies *oidcsvc.Service.
 	type discoveryTester interface {
 		TestDiscovery(ctx context.Context, issuerURL string) (*oidcsvc.TestDiscoveryResult, error)
 	}
 	tester, ok := h.oidcSvc.(discoveryTester)
 	if !ok {
 		Error(w, http.StatusInternalServerError, "OIDC service does not support discovery test")
 		return
 	}
 	res, terr := tester.TestDiscovery(r.Context(), strings.TrimSpace(req.IssuerURL))
 	if terr != nil {
 		Error(w, http.StatusInternalServerError, "discovery test execution failed")
 		return
 	}
 	h.recordAudit(r.Context(), "auth.oidc_provider_tested", caller.ActorID, caller.ActorType, "",
 		map[string]interface{}{
 			"issuer_url":          req.IssuerURL,
 			"discovery_succeeded": res.DiscoverySucceeded,
 			"jwks_reachable":      res.JWKSReachable,
 			"iss_param_supported": res.IssParamSupported,
 			"error_count":         len(res.Errors),
 		})
 	writeJSON(w, http.StatusOK, res)
 }
 // RefreshProvider handles POST /api/v1/auth/oidc/providers/{id}/refresh.
 // Forces re-fetch of the IdP discovery doc + JWKS, re-runs the IdP
 // downgrade-attack defense.
 func (h *AuthSessionOIDCHandler) RefreshProvider(w http.ResponseWriter, r *http.Request) {
 	caller, err := callerFromRequest(r)
 	if err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	id := r.PathValue("id")
 	if id == "" {
 		Error(w, http.StatusBadRequest, "missing provider id")
 		return
 	}
 	if rerr := h.oidcSvc.RefreshKeys(r.Context(), id); rerr != nil {
 		if errors.Is(rerr, repository.ErrOIDCProviderNotFound) {
 			Error(w, http.StatusNotFound, "provider not found")
 			return
 		}
 		Error(w, http.StatusBadRequest, "refresh failed: "+rerr.Error())
 		return
 	}
 	h.recordAudit(r.Context(), "auth.oidc_provider_refreshed", caller.ActorID, caller.ActorType, id,
 		map[string]interface{}{"provider_id": id})
 	writeJSON(w, http.StatusOK, map[string]interface{}{"refreshed": true})
 }
 type groupMappingResponse struct {
 	ID         string `json:"id"`
 	ProviderID string `json:"provider_id"`
 	GroupName  string `json:"group_name"`
 	RoleID     string `json:"role_id"`
 	TenantID   string `json:"tenant_id"`
 	CreatedAt  string `json:"created_at"`
 }
 func mappingToResponse(m *oidcdomain.GroupRoleMapping) groupMappingResponse {
 	return groupMappingResponse{
 		ID: m.ID, ProviderID: m.ProviderID, GroupName: m.GroupName,
 		RoleID: m.RoleID, TenantID: m.TenantID,
 		CreatedAt: m.CreatedAt.UTC().Format(time.RFC3339),
 	}
 }
 type groupMappingRequest struct {
 	ProviderID string `json:"provider_id"`
 	GroupName  string `json:"group_name"`
 	RoleID     string `json:"role_id"`
 }
 // ListGroupMappings handles GET /api/v1/auth/oidc/group-mappings?provider_id=<id>.
 func (h *AuthSessionOIDCHandler) ListGroupMappings(w http.ResponseWriter, r *http.Request) {
 	if _, err := callerFromRequest(r); err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	providerID := strings.TrimSpace(r.URL.Query().Get("provider_id"))
 	if providerID == "" {
 		Error(w, http.StatusBadRequest, "missing required query parameter `provider_id`")
 		return
 	}
 	mappings, lerr := h.mappingRepo.ListByProvider(r.Context(), providerID)
 	if lerr != nil {
 		Error(w, http.StatusInternalServerError, "could not list mappings")
 		return
 	}
 	out := make([]groupMappingResponse, 0, len(mappings))
 	for _, m := range mappings {
 		out = append(out, mappingToResponse(m))
 	}
 	writeJSON(w, http.StatusOK, map[string]interface{}{"mappings": out})
 }
 // AddGroupMapping handles POST /api/v1/auth/oidc/group-mappings.
 func (h *AuthSessionOIDCHandler) AddGroupMapping(w http.ResponseWriter, r *http.Request) {
 	caller, err := callerFromRequest(r)
 	if err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	var req groupMappingRequest
 	if derr := json.NewDecoder(r.Body).Decode(&req); derr != nil {
 		Error(w, http.StatusBadRequest, "invalid JSON body")
 		return
 	}
 	mapping := &oidcdomain.GroupRoleMapping{
 		ID:         "grm-" + randomB64URLForHandler(16),
 		ProviderID: req.ProviderID,
 		GroupName:  req.GroupName,
 		RoleID:     req.RoleID,
 		TenantID:   h.tenantID,
 	}
 	if verr := mapping.Validate(); verr != nil {
 		Error(w, http.StatusBadRequest, verr.Error())
 		return
 	}
 	if aerr := h.mappingRepo.Add(r.Context(), mapping); aerr != nil {
 		if errors.Is(aerr, repository.ErrGroupRoleMappingDuplicate) {
 			Error(w, http.StatusConflict, "mapping already exists")
 			return
 		}
 		Error(w, http.StatusInternalServerError, "could not add mapping")
 		return
 	}
 	h.recordAudit(r.Context(), "auth.group_mapping_added", caller.ActorID, caller.ActorType, mapping.ID,
 		map[string]interface{}{
 			"mapping_id": mapping.ID, "provider_id": mapping.ProviderID,
 			"group_name": mapping.GroupName, "role_id": mapping.RoleID,
 		})
 	writeJSON(w, http.StatusCreated, mappingToResponse(mapping))
 }
 // RemoveGroupMapping handles DELETE /api/v1/auth/oidc/group-mappings/{id}.
 func (h *AuthSessionOIDCHandler) RemoveGroupMapping(w http.ResponseWriter, r *http.Request) {
 	caller, err := callerFromRequest(r)
 	if err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	id := r.PathValue("id")
 	if id == "" {
 		Error(w, http.StatusBadRequest, "missing mapping id")
 		return
 	}
 	if rerr := h.mappingRepo.Remove(r.Context(), id); rerr != nil {
 		if errors.Is(rerr, repository.ErrGroupRoleMappingNotFound) {
 			Error(w, http.StatusNotFound, "mapping not found")
 			return
 		}
 		Error(w, http.StatusInternalServerError, "could not remove mapping")
 		return
 	}
 	h.recordAudit(r.Context(), "auth.group_mapping_removed", caller.ActorID, caller.ActorType, id,
 		map[string]interface{}{"mapping_id": id})
 	w.WriteHeader(http.StatusNoContent)
 }
@@ -0,0 +1,390 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package handler
 import (
 	"errors"
 	"net/http"
 	"strings"
 	"time"
 	oidcdomain "github.com/certctl-io/certctl/internal/auth/oidc/domain"
 	sessionsvc "github.com/certctl-io/certctl/internal/auth/session"
 	sessiondomain "github.com/certctl-io/certctl/internal/auth/session/domain"
 	"github.com/certctl-io/certctl/internal/domain"
 	"github.com/certctl-io/certctl/internal/repository"
 )
 // Phase 9 ARCH-M2 closure Sprint 11 (2026-05-14): extracted from
 // internal/api/handler/auth_session_oidc.go via the Option B
 // sibling-file pattern. Package stays `handler`; every external
 // caller of `handler.AuthSessionOIDCHandler.{LoginInitiate,
 // LoginCallback, BackChannelLogout, Logout}` resolves the same
 // way — pure mechanical relocation. The router wiring in
 // internal/api/router/router.go is unaffected.
 //
 // This file holds Section 1 of the original file's three-section
 // layout (per its own package doc-comment): the PUBLIC OIDC
 // HANDSHAKE handlers. These four endpoints are auth-exempt — they
 // run before the caller has a certctl-issued credential:
 //
 //   GET  /auth/oidc/login?provider=<id>          -> 302 to IdP
 //   GET  /auth/oidc/callback?code=...&state=...  -> consume + mint
 //   POST /auth/oidc/back-channel-logout          -> IdP-initiated
 //   POST /auth/logout                            -> revoke caller's
 //
 // Helpers (h.clearPreLoginCookie / h.clearSessionCookies /
 // h.recordAudit / clientIPFromRequest / classifyOIDCFailure) stay
 // in auth_session_oidc.go alongside the AuthSessionOIDCHandler
 // struct + constructor — same-package resolution makes the calls
 // reach across the file boundary at zero compile-time cost.
 // =============================================================================
 // 1. Public OIDC handshake handlers.
 // =============================================================================
 // LoginInitiate handles GET /auth/oidc/login?provider=<id>.
 //
 // Generates state + nonce + PKCE-S256 verifier (in OIDCService),
 // persists the pre-login row, sets the certctl_oidc_pending cookie,
 // 302-redirects to the IdP authorization URL.
 func (h *AuthSessionOIDCHandler) LoginInitiate(w http.ResponseWriter, r *http.Request) {
 	providerID := strings.TrimSpace(r.URL.Query().Get("provider"))
 	if providerID == "" {
 		Error(w, http.StatusBadRequest, "missing required query parameter `provider`")
 		return
 	}
 	// Audit 2026-05-10 MED-16 — capture clientIP + UA at /auth/oidc/login
 	// so HandleCallback can reject a stolen pre-login cookie replayed
 	// from a different browser/source. clientIPFromRequest already
 	// honours the LOW-5 trusted-proxy gating; r.UserAgent() reads the
 	// header verbatim.
 	loginIP := clientIPFromRequest(r)
 	loginUA := r.UserAgent()
 	authURL, cookieValue, _, err := h.oidcSvc.HandleAuthRequest(r.Context(), providerID, loginIP, loginUA)
 	if err != nil {
 		// Provider not found is the most common case; map to 404.
 		if errors.Is(err, repository.ErrOIDCProviderNotFound) {
 			Error(w, http.StatusNotFound, "provider not found")
 			return
 		}
 		// Other errors (disco fetch failure / IdP downgrade defense /
 		// crypto failure) are server-side; surface as 500 without
 		// leaking details.
 		Error(w, http.StatusInternalServerError, "could not initiate OIDC login")
 		return
 	}
 	http.SetCookie(w, &http.Cookie{
 		Name:  sessiondomain.PreLoginCookieName,
 		Value: cookieValue,
 		// Audit 2026-05-10 MED-14 — `__Host-` prefix requires Path=/.
 		// The cookie lives 10 minutes and is only ever consumed by the
 		// callback handler; the wider path scope is harmless.
 		Path:     "/",
 		MaxAge:   int((10 * time.Minute).Seconds()),
 		Secure:   h.cookieAttrs.Secure,
 		HttpOnly: true,
 		// Pre-login cookie MUST be SameSite=Lax (cannot be Strict
 		// because the IdP-initiated callback is a top-level navigation
 		// from a different origin per Phase 5 spec).
 		SameSite: http.SameSiteLaxMode,
 	})
 	http.Redirect(w, r, authURL, http.StatusFound)
 }
 // LoginCallback handles GET /auth/oidc/callback?code=...&state=....
 //
 // Reads the certctl_oidc_pending cookie, drives OIDCService.HandleCallback
 // (which parses + HMAC-verifies the cookie, runs the 11-step token
 // validation, group-claim resolution, role-mapping, user-upsert),
 // mints a post-login session via SessionService.Create, deletes the
 // pre-login cookie, sets the post-login cookie + CSRF token cookie,
 // and 302's to the dashboard.
 func (h *AuthSessionOIDCHandler) LoginCallback(w http.ResponseWriter, r *http.Request) {
 	q := r.URL.Query()
 	code := strings.TrimSpace(q.Get("code"))
 	state := strings.TrimSpace(q.Get("state"))
 	// Audit 2026-05-10 MED-17 — RFC 9207 iss URL parameter. NOT
 	// trimmed; preserved exactly as sent so the service-layer compare
 	// against the matched provider's IssuerURL is byte-strict. The IdP
 	// emits this only when advertised in its discovery doc; the
 	// service-layer check is a no-op otherwise.
 	callbackIss := q.Get("iss")
 	if code == "" || state == "" {
 		Error(w, http.StatusBadRequest, "missing code or state query parameter")
 		return
 	}
 	preLoginCookie, err := r.Cookie(sessiondomain.PreLoginCookieName)
 	if err != nil || preLoginCookie.Value == "" {
 		Error(w, http.StatusBadRequest, "missing pre-login cookie")
 		h.recordAudit(r.Context(), "auth.oidc_login_failed", "anonymous", domain.ActorTypeSystem, "",
 			map[string]interface{}{"failure_category": "missing_pre_login_cookie"})
 		return
 	}
 	clientIP := clientIPFromRequest(r)
 	userAgent := r.UserAgent()
 	res, err := h.oidcSvc.HandleCallback(r.Context(), preLoginCookie.Value, code, state, callbackIss, clientIP, userAgent)
 	if err != nil {
 		// Audit 2026-05-10 HIGH-7 — instead of a blank 400, redirect
 		// to /login?error=oidc_failed&reason=<category>. The LoginPage
 		// reads the query params and renders an operator-friendly
 		// alert. The audit row still carries the specific
 		// failure_category so server-side observability is unchanged.
 		category := classifyOIDCFailure(err)
 		h.recordAudit(r.Context(), "auth.oidc_login_failed", "anonymous", domain.ActorTypeSystem, "",
 			map[string]interface{}{"failure_category": category})
 		// Special-case unmapped groups so the audit row name distinguishes
 		// it from generic failures (operator-policy decision).
 		if category == "unmapped_groups" {
 			h.recordAudit(r.Context(), "auth.oidc_login_unmapped_groups", "anonymous", domain.ActorTypeSystem, "",
 				map[string]interface{}{})
 		}
 		// Always clear the pre-login cookie on failure.
 		h.clearPreLoginCookie(w)
 		// 302 to the login page; the reason categorizes the failure for
 		// the GUI to render. Keep the redirect target relative — the
 		// SPA serves /login.
 		http.Redirect(w, r, "/login?error=oidc_failed&reason="+category, http.StatusFound)
 		return
 	}
 	// res from the OIDC service already carries cookieValue + CSRFToken
 	// (the OIDC service wraps SessionService internally per Phase 3).
 	// We re-emit them via the standard Set-Cookie helper here so cookie
 	// attributes stay handler-controlled.
 	now := time.Now().UTC()
 	expires := now.Add(8 * time.Hour) // matches default SessionConfig.AbsoluteTimeout
 	http.SetCookie(w, &http.Cookie{
 		Name:     sessiondomain.PostLoginCookieName,
 		Value:    res.CookieValue,
 		Path:     "/",
 		Expires:  expires,
 		Secure:   h.cookieAttrs.Secure,
 		HttpOnly: true,
 		SameSite: h.cookieAttrs.SameSite,
 	})
 	http.SetCookie(w, &http.Cookie{
 		Name:     sessiondomain.CSRFCookieName,
 		Value:    res.CSRFToken,
 		Path:     "/",
 		Expires:  expires,
 		Secure:   h.cookieAttrs.Secure,
 		HttpOnly: false, // intentional — GUI must read this to echo header
 		SameSite: h.cookieAttrs.SameSite,
 	})
 	h.clearPreLoginCookie(w)
 	userID := ""
 	if res.User != nil {
 		userID = res.User.ID
 	}
 	h.recordAudit(r.Context(), "auth.oidc_login_succeeded", userID, domain.ActorTypeUser, userID,
 		map[string]interface{}{
 			"user_id":  userID,
 			"role_ids": res.RoleIDs,
 		})
 	h.recordAudit(r.Context(), "auth.session_created", userID, domain.ActorTypeUser, userID,
 		map[string]interface{}{"user_id": userID})
 	http.Redirect(w, r, h.postLoginURL, http.StatusFound)
 }
 // BackChannelLogout handles POST /auth/oidc/back-channel-logout.
 //
 // OpenID Connect Back-Channel Logout 1.0. The IdP POSTs a logout_token
 // JWT in the body (form-encoded `logout_token=<jwt>`); certctl validates
 // signature against the IdP's JWKS, validates required claims (iss, aud,
 // iat, jti, events; exactly one of sub or sid; nonce ABSENT), revokes
 // matching sessions, returns 200 with Cache-Control: no-store. Failure
 // modes return 400 per spec §2.6.
 func (h *AuthSessionOIDCHandler) BackChannelLogout(w http.ResponseWriter, r *http.Request) {
 	if err := r.ParseForm(); err != nil {
 		Error(w, http.StatusBadRequest, "could not parse form body")
 		return
 	}
 	logoutToken := strings.TrimSpace(r.FormValue("logout_token"))
 	if logoutToken == "" {
 		Error(w, http.StatusBadRequest, "missing logout_token in form body")
 		return
 	}
 	issuer, sub, sid, jti, _, err := h.bclVerifier.Verify(r.Context(), logoutToken)
 	if err != nil {
 		// Per spec §2.6 — uniform 400 on any validation failure. The
 		// audit row carries the specific reason; the wire stays uniform.
 		// iat-skew rejections (Audit 2026-05-10 HIGH-3 iat-window check)
 		// land here too — the reason string distinguishes them.
 		h.recordAudit(r.Context(), "auth.oidc_back_channel_logout_failed", "anonymous", domain.ActorTypeSystem, "",
 			map[string]interface{}{"failure_reason": err.Error()})
 		Error(w, http.StatusBadRequest, "logout_token validation failed")
 		return
 	}
 	// Audit 2026-05-10 HIGH-3 — jti consumed-set. Atomic single-use
 	// semantics via the postgres ON CONFLICT DO NOTHING path. On
 	// replay return 200 + audit outcome=jti_replayed (RFC 9700 §2.7).
 	// On transient repo error return 503 so the IdP follows its retry
 	// semantics. When the consumer is nil (test path / pre-fix
 	// deployments) the consume step is skipped.
 	if h.bclReplay != nil && jti != "" {
 		ttl := h.bclMaxAge * 2
 		if ttl < 24*time.Hour {
 			ttl = 24 * time.Hour
 		}
 		if cerr := h.bclReplay.ConsumeJTI(r.Context(), jti, issuer, ttl); cerr != nil {
 			if errors.Is(cerr, repository.ErrBCLJTIAlreadyConsumed) {
 				h.recordAudit(r.Context(), "auth.oidc_back_channel_logout", "anonymous", domain.ActorTypeSystem, sub,
 					map[string]interface{}{"issuer": issuer, "subject": sub, "jti": jti, "outcome": "jti_replayed"})
 				w.Header().Set("Cache-Control", "no-store")
 				w.WriteHeader(http.StatusOK)
 				return
 			}
 			// Transient — let the IdP retry.
 			h.recordAudit(r.Context(), "auth.oidc_back_channel_logout_failed", "anonymous", domain.ActorTypeSystem, sub,
 				map[string]interface{}{"issuer": issuer, "subject": sub, "jti": jti, "outcome": "jti_consume_failed", "err": cerr.Error()})
 			http.Error(w, "transient", http.StatusServiceUnavailable)
 			return
 		}
 	}
 	// Resolve target sessions:
 	//   - sub set: revoke ALL sessions for the actor (oidc_subject lookup).
 	//   - sid set: revoke the specific session_id.
 	if sid != "" {
 		if rerr := h.sessionSvc.Revoke(r.Context(), sid); rerr != nil {
 			// Idempotent at the repo layer; rerr is unlikely. Audit
 			// regardless and return 200 (the IdP shouldn't retry on
 			// our errors).
 			_ = rerr
 		}
 		h.recordAudit(r.Context(), "auth.oidc_back_channel_logout", "anonymous", domain.ActorTypeSystem, sid,
 			map[string]interface{}{"sub_or_sid": "sid", "issuer": issuer, "session_id": sid})
 	} else if sub != "" {
 		// CRIT-2 closure of the 2026-05-10 audit. Pre-fix this branch called
 		// RevokeAllForActor(sub, "User") under the false assumption that
 		// the OIDC subject was used as the actor_id stem. In reality,
 		// internal/auth/oidc/service.go::upsertUser mints
 		// u.ID = "u-" + randomB64URL(16) and stores the OIDC subject in
 		// a separate column, so the pre-fix lookup never found a session
 		// row and the error was silently swallowed. BCL silently revoked
 		// nothing — CWE-613.
 		//
 		// The fix resolves the IdP-signed `iss` claim back to a provider
 		// row via providerRepo.List + IssuerURL filter, then resolves
 		// sub → user.ID via userRepo.GetByOIDCSubject, then revokes all
 		// sessions for that actor. Outcome categories audited:
 		//   - revoked            (happy path)
 		//   - issuer_unknown     (iss doesn't match any configured provider)
 		//   - user_unknown       (provider matched, but no user.id seeded for this subject)
 		//   - revoke_failed      (DB hiccup at the revoke step)
 		//   - provider_lookup_failed / user_lookup_failed → 503 (transient; IdP retries)
 		// All success-shaped outcomes return 200 + Cache-Control: no-store
 		// per OIDC BCL 1.0 §2.7. Transient errors return 503 so the IdP
 		// follows its own retry semantics.
 		providers, plerr := h.providerRepo.List(r.Context(), h.tenantID)
 		if plerr != nil {
 			h.recordAudit(r.Context(), "auth.oidc_back_channel_logout", "anonymous", domain.ActorTypeSystem, sub,
 				map[string]interface{}{"sub_or_sid": "sub", "issuer": issuer, "subject": sub, "outcome": "provider_lookup_failed"})
 			http.Error(w, "transient", http.StatusServiceUnavailable)
 			return
 		}
 		var matched *oidcdomain.OIDCProvider
 		for _, p := range providers {
 			if p.IssuerURL == issuer {
 				matched = p
 				break
 			}
 		}
 		if matched == nil {
 			h.recordAudit(r.Context(), "auth.oidc_back_channel_logout", "anonymous", domain.ActorTypeSystem, sub,
 				map[string]interface{}{"sub_or_sid": "sub", "issuer": issuer, "subject": sub, "outcome": "issuer_unknown"})
 			// Idempotent — return 200 per spec.
 			w.Header().Set("Cache-Control", "no-store")
 			w.WriteHeader(http.StatusOK)
 			return
 		}
 		user, uerr := h.userRepo.GetByOIDCSubject(r.Context(), matched.ID, sub)
 		if uerr != nil {
 			if errors.Is(uerr, repository.ErrUserNotFound) {
 				// Idempotent: nothing to revoke. IdP may BCL a user we
 				// never logged in. RFC compliance: still 200.
 				h.recordAudit(r.Context(), "auth.oidc_back_channel_logout", "anonymous", domain.ActorTypeSystem, sub,
 					map[string]interface{}{"sub_or_sid": "sub", "issuer": issuer, "subject": sub, "outcome": "user_unknown"})
 				w.Header().Set("Cache-Control", "no-store")
 				w.WriteHeader(http.StatusOK)
 				return
 			}
 			// Transient — let the IdP retry.
 			h.recordAudit(r.Context(), "auth.oidc_back_channel_logout", "anonymous", domain.ActorTypeSystem, sub,
 				map[string]interface{}{"sub_or_sid": "sub", "issuer": issuer, "subject": sub, "outcome": "user_lookup_failed"})
 			http.Error(w, "transient", http.StatusServiceUnavailable)
 			return
 		}
 		if rerr := h.sessionSvc.RevokeAllForActor(r.Context(), user.ID, string(domain.ActorTypeUser)); rerr != nil {
 			// Revoke failed — BCL is best-effort per §2.8; still 200,
 			// audit the failure.
 			h.recordAudit(r.Context(), "auth.oidc_back_channel_logout", user.ID, domain.ActorTypeUser, sub,
 				map[string]interface{}{"sub_or_sid": "sub", "issuer": issuer, "subject": sub, "outcome": "revoke_failed"})
 			w.Header().Set("Cache-Control", "no-store")
 			w.WriteHeader(http.StatusOK)
 			return
 		}
 		h.recordAudit(r.Context(), "auth.oidc_back_channel_logout", user.ID, domain.ActorTypeUser, sub,
 			map[string]interface{}{"sub_or_sid": "sub", "issuer": issuer, "subject": sub, "outcome": "revoked"})
 	}
 	// Per spec §2.7 — Cache-Control: no-store on success.
 	w.Header().Set("Cache-Control", "no-store")
 	w.WriteHeader(http.StatusOK)
 }
 // Logout handles POST /auth/logout. Revokes the caller's current
 // session. Permission: own session (any authenticated caller).
 func (h *AuthSessionOIDCHandler) Logout(w http.ResponseWriter, r *http.Request) {
 	caller, err := callerFromRequest(r)
 	if err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	// Resolve the caller's session via the cookie -> Validate path.
 	sessionCookie, cerr := r.Cookie(sessiondomain.PostLoginCookieName)
 	if cerr != nil || sessionCookie.Value == "" {
 		// No cookie => nothing to revoke; treat as success (idempotent).
 		h.clearSessionCookies(w)
 		w.WriteHeader(http.StatusNoContent)
 		return
 	}
 	sess, verr := h.sessionSvc.Validate(r.Context(), sessionsvc.ValidateInput{
 		CookieValue: sessionCookie.Value,
 		ClientIP:    clientIPFromRequest(r),
 		UserAgent:   r.UserAgent(),
 	})
 	if verr != nil {
 		// Cookie is invalid; clear + 204 (idempotent).
 		h.clearSessionCookies(w)
 		w.WriteHeader(http.StatusNoContent)
 		return
 	}
 	if rerr := h.sessionSvc.Revoke(r.Context(), sess.ID); rerr != nil {
 		Error(w, http.StatusInternalServerError, "could not revoke session")
 		return
 	}
 	// Audit 2026-05-11 Fix 13 — HIGH-2 fourth call site. Rotate the CSRF
 	// token on the actor's remaining sessions so a token captured in
 	// this device's browser pre-logout (DevTools, malicious extension,
 	// session-storage leak) can't be replayed against a sibling session
 	// (other browser, other device) after the user logged out here.
 	// The just-revoked session also rotates but its CSRF lookup will
 	// fail at the sessions table's revoked_at IS NOT NULL filter
 	// anyway; rotation on the revoked row is harmless. RotateCSRFTokenForActor
 	// returns the count rotated and NEVER errors — rotation is defense
 	// in depth and must not block the logout success.
 	rotated := h.sessionSvc.RotateCSRFTokenForActor(r.Context(), caller.ActorID, string(caller.ActorType))
 	h.recordAudit(r.Context(), "auth.session_revoked", caller.ActorID, caller.ActorType, sess.ID,
 		map[string]interface{}{"session_id": sess.ID, "self_initiated": true, "csrf_rotated": rotated})
 	h.clearSessionCookies(w)
 	w.WriteHeader(http.StatusNoContent)
 }
@@ -0,0 +1,207 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package handler
 import (
 	"errors"
 	"net/http"
 	"time"
 	sessionsvc "github.com/certctl-io/certctl/internal/auth/session"
 	sessiondomain "github.com/certctl-io/certctl/internal/auth/session/domain"
 	"github.com/certctl-io/certctl/internal/repository"
 )
 // Phase 9 ARCH-M2 closure Sprint 11 (2026-05-14): extracted from
 // internal/api/handler/auth_session_oidc.go via the Option B
 // sibling-file pattern.
 //
 // This file holds Section 2 of the original three-section layout:
 // the SESSION MANAGEMENT handlers (RBAC-gated). Three endpoints:
 //
 //   GET    /api/v1/auth/sessions              -> list (own / all-actors)
 //   DELETE /api/v1/auth/sessions/{id}         -> revoke (own / any)
 //   DELETE /api/v1/auth/sessions/all-except-current
 //                                             -> revoke-all-except-current
 //
 // The sessionResponse projection type lives here alongside its
 // callers (sessionToResponse + the three handler methods). It's
 // the shape the API renders externally; no external caller relies
 // on its exact file location.
 // =============================================================================
 // 2. Session management handlers (RBAC-gated).
 // =============================================================================
 type sessionResponse struct {
 	ID                string `json:"id"`
 	ActorID           string `json:"actor_id"`
 	ActorType         string `json:"actor_type"`
 	IPAddress         string `json:"ip_address,omitempty"`
 	UserAgent         string `json:"user_agent,omitempty"`
 	CreatedAt         string `json:"created_at"`
 	LastSeenAt        string `json:"last_seen_at"`
 	IdleExpiresAt     string `json:"idle_expires_at"`
 	AbsoluteExpiresAt string `json:"absolute_expires_at"`
 	Revoked           bool   `json:"revoked"`
 }
 func sessionToResponse(s *sessiondomain.Session) sessionResponse {
 	return sessionResponse{
 		ID:                s.ID,
 		ActorID:           s.ActorID,
 		ActorType:         s.ActorType,
 		IPAddress:         s.IPAddress,
 		UserAgent:         s.UserAgent,
 		CreatedAt:         s.CreatedAt.UTC().Format(time.RFC3339),
 		LastSeenAt:        s.LastSeenAt.UTC().Format(time.RFC3339),
 		IdleExpiresAt:     s.IdleExpiresAt.UTC().Format(time.RFC3339),
 		AbsoluteExpiresAt: s.AbsoluteExpiresAt.UTC().Format(time.RFC3339),
 		Revoked:           s.RevokedAt != nil,
 	}
 }
 // ListSessions handles GET /api/v1/auth/sessions.
 //
 // Default behavior: list current actor's sessions. With
 // ?actor_id=<other> + auth.session.list.all permission: list that
 // actor's sessions. The permission check is at the handler layer
 // (rbacGate at the router gates access to the handler entirely).
 func (h *AuthSessionOIDCHandler) ListSessions(w http.ResponseWriter, r *http.Request) {
 	caller, err := callerFromRequest(r)
 	if err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	// Default to the caller's own sessions.
 	actorID := caller.ActorID
 	actorType := string(caller.ActorType)
 	if q := r.URL.Query().Get("actor_id"); q != "" && q != actorID {
 		// Audit 2026-05-10 MED-2 closure — listing a different
 		// actor's sessions requires the narrower auth.session.list.all
 		// permission. The router gate already enforced
 		// auth.session.list (the floor for any session-list call),
 		// but the all-actors variant is an admin-class capability and
 		// must be checked separately because the rbacGate can't see
 		// the query param. When the handler is wired with
 		// WithPermissionChecker (production), we re-check inline; when
 		// it isn't (legacy tests), the router gate's auth.session.list
 		// floor is the only check.
 		if h.checker != nil {
 			ok, perr := h.checker.CheckPermission(r.Context(),
 				caller.ActorID, string(caller.ActorType), h.tenantID,
 				"auth.session.list.all", "global", nil)
 			if perr != nil {
 				Error(w, http.StatusInternalServerError, "permission check failed")
 				return
 			}
 			if !ok {
 				Error(w, http.StatusForbidden, "auth.session.list.all required to list another actor's sessions")
 				return
 			}
 		}
 		actorID = q
 		if at := r.URL.Query().Get("actor_type"); at != "" {
 			actorType = at
 		}
 	}
 	sessions, lerr := h.sessionRepo.ListByActor(r.Context(), actorID, actorType, h.tenantID)
 	if lerr != nil {
 		Error(w, http.StatusInternalServerError, "could not list sessions")
 		return
 	}
 	out := make([]sessionResponse, 0, len(sessions))
 	for _, s := range sessions {
 		out = append(out, sessionToResponse(s))
 	}
 	writeJSON(w, http.StatusOK, map[string]interface{}{"sessions": out})
 }
 // RevokeSession handles DELETE /api/v1/auth/sessions/{id}.
 func (h *AuthSessionOIDCHandler) RevokeSession(w http.ResponseWriter, r *http.Request) {
 	caller, err := callerFromRequest(r)
 	if err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	sessionID := r.PathValue("id")
 	if sessionID == "" {
 		Error(w, http.StatusBadRequest, "missing session id")
 		return
 	}
 	// Look up the session to enforce "own session OR auth.session.revoke".
 	sess, gerr := h.sessionRepo.Get(r.Context(), sessionID)
 	if gerr != nil {
 		if errors.Is(gerr, repository.ErrSessionNotFound) {
 			Error(w, http.StatusNotFound, "session not found")
 			return
 		}
 		Error(w, http.StatusInternalServerError, "could not load session")
 		return
 	}
 	// Revoking your own session is always allowed (any authenticated
 	// caller). Revoking someone else's session requires the
 	// auth.session.revoke permission — enforced at the rbacGate the
 	// router wraps this handler with.
 	if sess.ActorID == caller.ActorID && sess.ActorType == string(caller.ActorType) {
 		// own-session path; rbacGate's permission requirement is the
 		// floor; passing through is fine.
 	}
 	if rerr := h.sessionSvc.Revoke(r.Context(), sessionID); rerr != nil {
 		Error(w, http.StatusInternalServerError, "could not revoke session")
 		return
 	}
 	h.recordAudit(r.Context(), "auth.session_revoked", caller.ActorID, caller.ActorType, sessionID,
 		map[string]interface{}{"session_id": sessionID, "target_actor_id": sess.ActorID})
 	w.WriteHeader(http.StatusNoContent)
 }
 // RevokeAllExceptCurrent handles DELETE /api/v1/auth/sessions?except=current.
 //
 // Audit 2026-05-10 MED-3 closure — backs the "Sign out all other
 // sessions" SessionsPage button. Revokes every active session for the
 // caller EXCEPT the session that issued the current request (so the
 // user doesn't get logged out by the action they just took).
 //
 // The current session ID is read from the request's session cookie via
 // the SessionMiddleware's actor context — for Bearer-mode callers this
 // is the empty string and ALL the actor's sessions are revoked (matches
 // the "log me out everywhere" semantic for API-key-mode users).
 //
 // Audit row records the count for compliance (one summary row per
 // invocation; per-session detail is implicit in the count + actor).
 func (h *AuthSessionOIDCHandler) RevokeAllExceptCurrent(w http.ResponseWriter, r *http.Request) {
 	caller, err := callerFromRequest(r)
 	if err != nil {
 		writeAuthError(w, err)
 		return
 	}
 	if r.URL.Query().Get("except") != "current" {
 		Error(w, http.StatusBadRequest, "only ?except=current is supported")
 		return
 	}
 	// Current session ID — empty for Bearer/API-key callers (acceptable;
 	// the repo's RevokeAllExceptForActor handles "" by revoking
 	// literally every active session). Read from the session middleware's
 	// SessionFromContext helper which populates the validated session
 	// on the request context for cookie-mode callers.
 	currentSessionID := ""
 	if sess := sessionsvc.SessionFromContext(r.Context()); sess != nil {
 		currentSessionID = sess.ID
 	}
 	count, rerr := h.sessionRepo.RevokeAllExceptForActor(r.Context(),
 		caller.ActorID, string(caller.ActorType), h.tenantID, currentSessionID)
 	if rerr != nil {
 		Error(w, http.StatusInternalServerError, "could not revoke sessions")
 		return
 	}
 	h.recordAudit(r.Context(), "auth.sessions_revoked_all_except_current",
 		caller.ActorID, caller.ActorType, currentSessionID,
 		map[string]interface{}{
 			"count":              count,
 			"current_session_id": currentSessionID,
 		})
 	writeJSON(w, http.StatusOK, map[string]interface{}{"revoked_count": count})
 }
@@ -255,6 +255,14 @@ func (s *stubUserRepo) ListAll(_ context.Context, _ string) ([]*userdomain.User,
 	return nil, nil
 }
 // ListDeactivatedBefore satisfies the Sprint 6 COMP-002-RETENTION
 // interface addition. The phase-5 OIDC handler tests don't exercise
 // retention paths, so an empty result keeps the contract without
 // changing test semantics.
 func (s *stubUserRepo) ListDeactivatedBefore(_ context.Context, _ time.Time) ([]*userdomain.User, error) {
 	return nil, nil
 }
 type phase5StubAudit struct {
 	events []string
 	// Audit 2026-05-11 Fix 13 — capture the details map so the
@@ -83,6 +83,20 @@ func (s *stubFullUserRepo) ListAll(_ context.Context, tenantID string) ([]*userd
 	return out, nil
 }
 // ListDeactivatedBefore satisfies the Sprint 6 COMP-002-RETENTION
 // interface addition. Walk rows, filter by DeactivatedAt-before-threshold.
 // Order is intentionally not stabilised — the auth_users handler tests
 // don't exercise the retention loop.
 func (s *stubFullUserRepo) ListDeactivatedBefore(_ context.Context, threshold time.Time) ([]*userdomain.User, error) {
 	var out []*userdomain.User
 	for _, u := range s.rows {
 		if u.DeactivatedAt != nil && u.DeactivatedAt.Before(threshold) {
 			out = append(out, u)
 		}
 	}
 	return out, nil
 }
 // stubRevoker records cascade-revoke calls.
 type stubRevoker struct {
 	called    bool
@@ -52,7 +52,7 @@ type CertificateService interface {
 // CertificateHandler handles HTTP requests for certificate operations.
 type CertificateHandler struct {
 	svc         CertificateService
-	ocspLimiter *ratelimit.SlidingWindowLimiter // production hardening II Phase 3 — per-source-IP cap on OCSP
+	ocspLimiter ratelimit.Limiter // production hardening II Phase 3 — per-source-IP cap on OCSP
 }
 // NewCertificateHandler creates a new CertificateHandler with a service dependency.
@@ -65,7 +65,7 @@ func NewCertificateHandler(svc CertificateService) CertificateHandler {
 // cmd/server/main.go): 1000 req/min/IP. Setting to nil disables the
 // limit; the limiter's own NewSlidingWindowLimiter(maxN<=0, ...)
 // also produces a no-op limiter, so the env-var-zero case is safe.
-func (h *CertificateHandler) SetOCSPRateLimiter(l *ratelimit.SlidingWindowLimiter) {
+func (h *CertificateHandler) SetOCSPRateLimiter(l ratelimit.Limiter) {
 	h.ocspLimiter = l
 }
@@ -100,13 +100,13 @@ type ESTHandler struct {
 	// EST RFC 7030 hardening Phase 3.3: per-handler source-IP rate
 	// limiter for FAILED HTTP Basic auth attempts. Keyed by sourceIP so
 	// a hostile network segment can't burn through the password.
-	failedBasicLimiter *ratelimit.SlidingWindowLimiter
+	failedBasicLimiter ratelimit.Limiter
 	// EST RFC 7030 hardening Phase 4.2: per-handler per-principal sliding-
 	// window rate limit. Keyed by (CSR-CN, sourceIP) so a stolen
 	// bootstrap cert AND a known device CN can't be used to flood the
 	// issuer. Disabled when nil; configured per-profile.
-	perPrincipalLimiter *ratelimit.SlidingWindowLimiter
+	perPrincipalLimiter ratelimit.Limiter
 	// labelForLog gives observability code a per-profile string to
 	// include in audit log lines / Prometheus labels. Defaults to
@@ -170,7 +170,7 @@ func (h *ESTHandler) SetEnrollmentPassword(pw string) { h.basicPassword = pw }
 // rate limiter. Phase 3.3. Disabled when nil — but Validate() at
 // startup refuses an enabled basic-auth profile without a configured
 // limiter, so a real deploy always wires one.
-func (h *ESTHandler) SetSourceIPRateLimiter(l *ratelimit.SlidingWindowLimiter) {
+func (h *ESTHandler) SetSourceIPRateLimiter(l ratelimit.Limiter) {
 	h.failedBasicLimiter = l
 }
@@ -179,7 +179,7 @@ func (h *ESTHandler) SetSourceIPRateLimiter(l *ratelimit.SlidingWindowLimiter) {
 // every successful enrollment, NOT just failures — the goal is to
 // bound enrollment-flooding from a compromised credential, not just
 // failed-auth brute force.
-func (h *ESTHandler) SetPerPrincipalRateLimiter(l *ratelimit.SlidingWindowLimiter) {
+func (h *ESTHandler) SetPerPrincipalRateLimiter(l ratelimit.Limiter) {
 	h.perPrincipalLimiter = l
 }
@@ -28,7 +28,7 @@ type ExportService interface {
 // ExportHandler handles HTTP requests for certificate export operations.
 type ExportHandler struct {
 	svc           ExportService
-	exportLimiter *ratelimit.SlidingWindowLimiter // production hardening II Phase 3
+	exportLimiter ratelimit.Limiter // production hardening II Phase 3
 }
 // NewExportHandler creates a new ExportHandler with a service dependency.
@@ -40,7 +40,7 @@ func NewExportHandler(svc ExportService) ExportHandler {
 // Production hardening II Phase 3. Default cap (when set in
 // cmd/server/main.go): 50 exports/hr/operator. Setting to nil
 // disables the limit.
-func (h *ExportHandler) SetExportRateLimiter(l *ratelimit.SlidingWindowLimiter) {
+func (h *ExportHandler) SetExportRateLimiter(l ratelimit.Limiter) {
 	h.exportLimiter = l
 }
@@ -102,6 +102,20 @@ type ExpiryAlertSnapshotter interface {
 	SnapshotExpiryAlerts() []service.ExpiryAlertSnapshotEntry
 }
 // AuditChainCounterSnapshotter is the surface MetricsHandler consumes
 // to emit the Sprint 6 COMP-001-HASH tamper-evidence counters:
 //
 //	certctl_audit_chain_break_detected_total counter
 //	certctl_audit_chain_verify_total          counter
 //	certctl_audit_chain_rows                  gauge
 //	certctl_audit_chain_last_verified_at      gauge (unix seconds)
 //
 // *service.AuditChainCounter satisfies this. nil disables emission;
 // cmd/server/main.go wires the instance at startup.
 type AuditChainCounterSnapshotter interface {
 	Snapshot() service.AuditChainSnapshot
 }
 // MetricsHandler handles HTTP requests for metrics.
 // Supports both JSON format (GET /api/v1/metrics) and Prometheus exposition format
 // (GET /api/v1/metrics/prometheus) for integration with Prometheus, Grafana, Datadog, etc.
@@ -129,6 +143,10 @@ type MetricsHandler struct {
 	// 2026-05-03 Infisical deep-research deliverable. nil disables
 	// emission of certctl_expiry_alerts_total{channel,threshold,result}.
 	expiryAlerts ExpiryAlertSnapshotter
 	// Sprint 6 COMP-001-HASH tamper-evidence counters. nil disables
 	// emission of certctl_audit_chain_* metrics. *service.AuditChainCounter
 	// is the production wiring; cmd/server/main.go sets this at startup.
 	auditChainCounter AuditChainCounterSnapshotter
 }
 // NewMetricsHandler creates a new MetricsHandler with a service dependency.
@@ -177,6 +195,14 @@ func (h *MetricsHandler) SetExpiryAlerts(c ExpiryAlertSnapshotter) {
 	h.expiryAlerts = c
 }
 // SetAuditChainCounter wires the Sprint 6 COMP-001-HASH tamper-evidence
 // counters for the Prometheus exposition. nil disables the block.
 // The counter is also passed to scheduler.SetAuditChainBreakRecorder so
 // the verify loop writes to the same instance the handler reads.
 func (h *MetricsHandler) SetAuditChainCounter(c AuditChainCounterSnapshotter) {
 	h.auditChainCounter = c
 }
 // MetricsResponse represents the JSON metrics response for V2.
 type MetricsResponse struct {
 	Gauge   MetricsGauge   `json:"gauge"`
@@ -523,6 +549,29 @@ func (h MetricsHandler) GetPrometheusMetrics(w http.ResponseWriter, r *http.Requ
 			}
 		}
 	}
 	// Sprint 6 COMP-001-HASH tamper-evidence counters. Emitted as four
 	// adjacent series so an alert rule can fire on any non-zero
 	// certctl_audit_chain_break_detected_total (the operator-actionable
 	// signal — see docs/operator/audit-chain.md).
 	if h.auditChainCounter != nil {
 		snap := h.auditChainCounter.Snapshot()
 		fmt.Fprintf(w, "\n# HELP certctl_audit_chain_break_detected_total Number of audit_events hash-chain breaks detected (Sprint 6 COMP-001-HASH).\n")
 		fmt.Fprintf(w, "# TYPE certctl_audit_chain_break_detected_total counter\n")
 		fmt.Fprintf(w, "certctl_audit_chain_break_detected_total %d\n", snap.BreaksDetected)
 		fmt.Fprintf(w, "# HELP certctl_audit_chain_verify_total Number of audit_events_verify_chain() walks completed by the scheduler.\n")
 		fmt.Fprintf(w, "# TYPE certctl_audit_chain_verify_total counter\n")
 		fmt.Fprintf(w, "certctl_audit_chain_verify_total %d\n", snap.WalksCompleted)
 		fmt.Fprintf(w, "# HELP certctl_audit_chain_rows Most recent walk's row count (gauge — last-write-wins).\n")
 		fmt.Fprintf(w, "# TYPE certctl_audit_chain_rows gauge\n")
 		fmt.Fprintf(w, "certctl_audit_chain_rows %d\n", snap.LastRowCount)
 		fmt.Fprintf(w, "# HELP certctl_audit_chain_last_verified_at Unix seconds of most recent walk (0 = never).\n")
 		fmt.Fprintf(w, "# TYPE certctl_audit_chain_last_verified_at gauge\n")
 		fmt.Fprintf(w, "certctl_audit_chain_last_verified_at %d\n", snap.LastVerifiedAtUnix)
 	}
 }
 // formatLE formats a histogram bucket boundary the way Prometheus
@@ -170,6 +170,14 @@ func (r *intuneE2EAuditRepo) List(_ context.Context, _ *repository.AuditFilter)
 	return nil, nil
 }
 // VerifyHashChain satisfies the Sprint 6 COMP-001-HASH interface
 // addition. In-memory stub: always clean.
 func (r *intuneE2EAuditRepo) VerifyHashChain(_ context.Context) (string, int, int, error) {
 	r.mu.Lock()
 	defer r.mu.Unlock()
 	return "", -1, len(r.events), nil
 }
 func (r *intuneE2EAuditRepo) actions() []string {
 	r.mu.Lock()
 	defer r.mu.Unlock()
@@ -0,0 +1,291 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package middleware
 import (
 	"bytes"
 	"crypto/sha256"
 	"encoding/hex"
 	"net/http"
 	"strings"
 )
 // Phase 6 SCALE-L2 closure (2026-05-14): ETag / If-None-Match
 // middleware for read-heavy list endpoints.
 //
 // Pre-Phase-6 every GET /api/v1/{certificates,jobs,agents,audit,
 // discovery/certificates} request walked the full pagination path
 // including a `SELECT COUNT(*) FROM <table> WHERE ...` query for
 // the metadata block. The dashboard's polling loop alone hits these
 // endpoints every 30s; on a 50K-cert fleet that's ~14K COUNT(*)
 // rows scanned per minute for a result the operator hasn't actually
 // changed.
 //
 // This middleware sits in front of the handler and:
 //
 //   1. Lets the handler run normally (writing JSON to a response
 //      buffer rather than the wire).
 //   2. Computes a SHA-256 ETag of the buffered response body. The
 //      ETag is deterministic over (body bytes), so when the
 //      underlying list contents are unchanged the ETag is the same
 //      regardless of which replica served the request.
 //   3. Compares the computed ETag against the request's
 //      `If-None-Match` header. Match → write 304 Not Modified with
 //      an empty body. No match → write the full response with the
 //      `ETag:` header set so the client can store it for the next
 //      request.
 //
 // Constraints / non-goals:
 //
 //   - GET / HEAD only. POST / PUT / DELETE bypass the middleware
 //     (ETags on mutations introduce cache-correctness bugs around
 //     the request body not matching the response body).
 //   - Non-2xx responses (4xx errors, 5xx) bypass the ETag
 //     computation. The handler's error responses go through
 //     unchanged.
 //   - Responses larger than maxETagBufferBytes (64 KiB) skip the
 //     hash. Buffering very large response bodies in-memory just to
 //     hash them would cost more than the cache win. The default
 //     covers the cursor-paginated 100-row default on every list
 //     endpoint; raising the page-size override could exceed the
 //     limit, in which case ETag silently degrades to "no caching"
 //     for those calls.
 //   - The hash is computed over the response body bytes, NOT over
 //     a (max-updated-at, row-count) tuple from the DB. This is the
 //     less-clever-but-more-correct choice: any response-shape
 //     change (a new field added by a handler refactor, locale
 //     formatting drift, ordering shuffles) produces a fresh ETag
 //     automatically without requiring per-endpoint metadata
 //     wiring. The cost is one SHA-256 pass over the response body
 //     per request, which is dwarfed by the JSON marshaling cost
 //     already in the path.
 const (
 	// maxETagBufferBytes caps how much response body the middleware
 	// will buffer for hashing. 64 KiB covers a 100-row cursor page
 	// at the default 500-bytes-per-row JSON shape on every list
 	// endpoint. Responses larger than this skip the ETag pass.
 	maxETagBufferBytes = 64 * 1024
 )
 // ETag returns middleware that emits a strong ETag header on
 // successful GET / HEAD responses and short-circuits 304 Not
 // Modified on If-None-Match match. Use it by wrapping the handler
 // chain in front of the list endpoints:
 //
 //	mux.Handle("GET /api/v1/certificates", middleware.ETag(h.ListCertificates))
 //
 // Or per router-registration if the router supports method-aware
 // wrapping; see internal/api/router/router.go for the wiring shape.
 func ETag(next http.Handler) http.Handler {
 	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		// Only GET + HEAD benefit. POST/PUT/DELETE always run.
 		if r.Method != http.MethodGet && r.Method != http.MethodHead {
 			next.ServeHTTP(w, r)
 			return
 		}
 		// Buffer the handler's response. The handler still calls
 		// w.WriteHeader / w.Write normally; the recorder captures
 		// the bytes + status code for the post-handler ETag pass.
 		rec := &etagRecorder{
 			ResponseWriter: w,
 			body:           bytes.NewBuffer(nil),
 			status:         http.StatusOK,
 			headerWritten:  false,
 		}
 		next.ServeHTTP(rec, r)
 		// Only successful responses get cached. 304s never reach
 		// here (we'd be short-circuiting BEFORE the handler ran).
 		// 4xx / 5xx responses pass through unchanged because the
 		// handler's error body shouldn't be cached against an
 		// ETag.
 		if rec.status < 200 || rec.status >= 300 {
 			rec.flush()
 			return
 		}
 		// Skip ETag pass for over-sized responses. The buffer cap
 		// caught the body; emitting it without an ETag is the
 		// degradation path.
 		if rec.bodyTruncated {
 			rec.flush()
 			return
 		}
 		// Compute the ETag over the buffered body.
 		bodyBytes := rec.body.Bytes()
 		sum := sha256.Sum256(bodyBytes)
 		etag := `"` + hex.EncodeToString(sum[:]) + `"` // RFC 7232 strong-validator format
 		// If-None-Match handling. The header can be a
 		// comma-separated list; check each candidate against the
 		// computed ETag.
 		if matchETag(r.Header.Get("If-None-Match"), etag) {
 			// 304 Not Modified — preserve the ETag header but
 			// emit no body. Drop Content-Length to avoid the
 			// "declared length doesn't match body" mismatch some
 			// proxies are strict about.
 			h := w.Header()
 			h.Set("ETag", etag)
 			h.Del("Content-Length")
 			h.Del("Content-Type")
 			w.WriteHeader(http.StatusNotModified)
 			return
 		}
 		// Cache miss / first request. Emit the full response with
 		// ETag header for the next request to use.
 		w.Header().Set("ETag", etag)
 		rec.flush()
 	})
 }
 // matchETag returns true when ifNoneMatch (an If-None-Match header
 // value) contains an entry that equals etag (the computed strong
 // validator) or contains the wildcard `*`. RFC 7232 §3.2 says:
 //
 //	If-None-Match = "*" / 1#entity-tag
 //
 // Strong comparison is appropriate for our use because all our
 // ETags are strong (computed over response bytes); we never emit
 // weak validators (`W/"..."`).
 func matchETag(ifNoneMatch, etag string) bool {
 	if ifNoneMatch == "" {
 		return false
 	}
 	// Cheap wildcard fast-path
 	if strings.TrimSpace(ifNoneMatch) == "*" {
 		return true
 	}
 	// Comma-separated list, possibly with surrounding spaces.
 	for _, candidate := range strings.Split(ifNoneMatch, ",") {
 		if strings.TrimSpace(candidate) == etag {
 			return true
 		}
 	}
 	return false
 }
 // etagRecorder buffers response bytes + status so the post-handler
 // ETag pass can hash the body. WriteHeader and Write follow the
 // http.ResponseWriter contract; the recorder ONLY differs by
 // holding the bytes until flush() is called.
 type etagRecorder struct {
 	http.ResponseWriter
 	body                *bytes.Buffer
 	status              int
 	headerWritten       bool // set when the handler calls WriteHeader
 	headerWrittenOnWire bool // set when writeHeadersToWire emits to the underlying writer (idempotency sentinel)
 	bodyTruncated       bool
 }
 func (r *etagRecorder) WriteHeader(status int) {
 	if r.headerWritten {
 		// Honor the http stdlib's contract: subsequent
 		// WriteHeader calls are ignored after the first.
 		return
 	}
 	r.status = status
 	r.headerWritten = true
 }
 func (r *etagRecorder) Write(b []byte) (int, error) {
 	if r.bodyTruncated {
 		// The buffer's full; subsequent writes are reported as
 		// successful but never make it into the buffer. flush()
 		// writes the buffer + any further bytes directly when it
 		// runs (see flush implementation below). Returning the
 		// caller-requested length here preserves io.Writer
 		// semantics for the handler.
 		return len(b), nil
 	}
 	// Track whether THIS write would push us over the cap. If
 	// yes, stop buffering — the body is too big to ETag.
 	if r.body.Len()+len(b) > maxETagBufferBytes {
 		r.bodyTruncated = true
 		// Flush the buffered prefix + this chunk straight to the
 		// wire; preserve the handler's bytes-written count.
 		// Headers haven't been written yet (we hold them until
 		// flush); write them now.
 		r.writeHeadersToWire()
 		if r.body.Len() > 0 {
 			if _, err := r.ResponseWriter.Write(r.body.Bytes()); err != nil {
 				return 0, err
 			}
 			r.body.Reset()
 		}
 		return r.ResponseWriter.Write(b)
 	}
 	return r.body.Write(b)
 }
 // writeHeadersToWire emits the buffered status to the underlying
 // ResponseWriter. Idempotent — subsequent calls are no-ops.
 func (r *etagRecorder) writeHeadersToWire() {
 	if !r.headerWritten {
 		// Handler never called WriteHeader explicitly; the
 		// http.ResponseWriter contract says that's an implicit
 		// 200 OK on the first Write.
 		r.status = http.StatusOK
 		r.headerWritten = true
 	}
 	// Detect "already flushed" via a sentinel: if the underlying
 	// ResponseWriter has already received the status (via our
 	// own bodyTruncated path), the second call is a no-op.
 	// Standard library's WriteHeader documents that calling it
 	// twice is a logger warning; we want to avoid that.
 	// To avoid double-write, we use an internal flag.
 	if r.bodyTruncated && r.headerWrittenOnWire {
 		return
 	}
 	// Hotfix #12 (CodeQL alert #34 — go/reflected-xss): defense-in-
 	// depth Content-Type guard. This middleware is wired ONLY to JSON
 	// list endpoints (GET /api/v1/{certificates,agents,jobs,audit,
 	// discovered-certificates} — see internal/api/router/router.go).
 	// Every wrapped handler currently sets Content-Type:
 	// application/json via handler.JSON() before the first Write. But
 	// the recorder is a generic byte forwarder; CodeQL's data-flow
 	// query sees `r.ResponseWriter.Write(b)` at the sink and can't
 	// see that the wrapped handler set a non-HTML Content-Type — so
 	// it flags reflected-XSS even though browsers don't render
 	// application/json as HTML. The fix is to make the Content-Type
 	// guarantee explicit at the chokepoint: if the wrapped handler
 	// forgot to set Content-Type, default to application/json +
 	// charset=utf-8 here. Behavior-preserving for the 5 current
 	// handlers (they all set Content-Type) and a safe guard against
 	// a future handler bug that would otherwise let the browser
 	// content-sniff a JSON body as text/html.
 	//
 	// Drop the embedded-field selector for Header() — etagRecorder
 	// doesn't override Header(), so r.Header() resolves to the
 	// embedded ResponseWriter.Header() (staticcheck QF1008). The
 	// neighboring r.ResponseWriter.WriteHeader / r.ResponseWriter.Write
 	// calls intentionally KEEP the explicit selector because
 	// etagRecorder.Write / etagRecorder.WriteHeader override them
 	// and the embedded form is required to bypass recursion.
 	hdr := r.Header()
 	if hdr.Get("Content-Type") == "" {
 		hdr.Set("Content-Type", "application/json; charset=utf-8")
 	}
 	r.ResponseWriter.WriteHeader(r.status)
 	r.headerWrittenOnWire = true
 }
 // flush emits the buffered status + body to the underlying
 // ResponseWriter. Called by the ETag middleware after the handler
 // returns AND the response is either a cache miss (no
 // If-None-Match match) or non-cacheable (4xx, oversized).
 func (r *etagRecorder) flush() {
 	if r.bodyTruncated {
 		// Headers + body already on the wire via Write's
 		// truncation path. Nothing to flush.
 		return
 	}
 	r.writeHeadersToWire()
 	if r.body.Len() > 0 {
 		_, _ = r.ResponseWriter.Write(r.body.Bytes())
 	}
 }
@@ -0,0 +1,259 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package middleware
 import (
 	"io"
 	"net/http"
 	"net/http/httptest"
 	"strings"
 	"testing"
 )
 // Phase 6 SCALE-L2 contract pin (2026-05-14): the ETag middleware
 // must:
 //   1. Emit an ETag header on successful GET / HEAD responses.
 //   2. Return 304 Not Modified when the client's If-None-Match
 //      matches the computed ETag (cache hit).
 //   3. Return 200 + new ETag when the body has changed (cache miss
 //      after mutation).
 //   4. NOT apply to POST / PUT / DELETE.
 //   5. NOT apply to non-2xx responses (errors pass through unchanged).
 //   6. Skip ETag for over-sized responses (degrade gracefully, not
 //      crash).
 func TestETag_GET_EmitsETagHeader(t *testing.T) {
 	handler := ETag(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		w.Header().Set("Content-Type", "application/json")
 		_, _ = w.Write([]byte(`{"items":[{"id":"cert-1"}],"total":1}`))
 	}))
 	req := httptest.NewRequest(http.MethodGet, "/api/v1/certificates", nil)
 	rec := httptest.NewRecorder()
 	handler.ServeHTTP(rec, req)
 	if rec.Code != http.StatusOK {
 		t.Errorf("status = %d; want 200", rec.Code)
 	}
 	if etag := rec.Header().Get("ETag"); etag == "" {
 		t.Errorf("ETag header is empty; want non-empty strong validator")
 	}
 	if !strings.Contains(rec.Body.String(), "cert-1") {
 		t.Errorf("body missing handler output: %q", rec.Body.String())
 	}
 }
 func TestETag_RepeatedRequest_Returns304(t *testing.T) {
 	body := []byte(`{"items":[{"id":"cert-1"}],"total":1}`)
 	handler := ETag(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		w.Header().Set("Content-Type", "application/json")
 		_, _ = w.Write(body)
 	}))
 	// First request — establish the cache.
 	req1 := httptest.NewRequest(http.MethodGet, "/api/v1/certificates", nil)
 	rec1 := httptest.NewRecorder()
 	handler.ServeHTTP(rec1, req1)
 	etag := rec1.Header().Get("ETag")
 	if etag == "" {
 		t.Fatal("first response missing ETag — cannot run cache-hit test")
 	}
 	// Second request with If-None-Match — should 304.
 	req2 := httptest.NewRequest(http.MethodGet, "/api/v1/certificates", nil)
 	req2.Header.Set("If-None-Match", etag)
 	rec2 := httptest.NewRecorder()
 	handler.ServeHTTP(rec2, req2)
 	if rec2.Code != http.StatusNotModified {
 		t.Errorf("status = %d; want 304 Not Modified (cache hit)", rec2.Code)
 	}
 	if rec2.Body.Len() != 0 {
 		t.Errorf("304 response body non-empty: %q (RFC 7232 §4.1: 304 MUST NOT have a body)", rec2.Body.String())
 	}
 	if rec2.Header().Get("ETag") != etag {
 		t.Errorf("304 response ETag = %q; want %q (must be preserved for next request)", rec2.Header().Get("ETag"), etag)
 	}
 }
 func TestETag_AfterMutation_Returns200WithNewETag(t *testing.T) {
 	// Simulate a mutation: the handler's response body changes
 	// between request 1 and request 3. Request 2 (with stale
 	// If-None-Match) must miss and return 200 + the new ETag.
 	currentBody := []byte(`{"items":[{"id":"cert-1"}],"total":1}`)
 	handler := ETag(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		w.Header().Set("Content-Type", "application/json")
 		_, _ = w.Write(currentBody)
 	}))
 	// Initial request — capture ETag.
 	req1 := httptest.NewRequest(http.MethodGet, "/api/v1/certificates", nil)
 	rec1 := httptest.NewRecorder()
 	handler.ServeHTTP(rec1, req1)
 	etag1 := rec1.Header().Get("ETag")
 	// Simulate a mutation by changing the response body.
 	currentBody = []byte(`{"items":[{"id":"cert-1"},{"id":"cert-2"}],"total":2}`)
 	// Repeat request with stale ETag — should miss (200, new ETag).
 	req2 := httptest.NewRequest(http.MethodGet, "/api/v1/certificates", nil)
 	req2.Header.Set("If-None-Match", etag1)
 	rec2 := httptest.NewRecorder()
 	handler.ServeHTTP(rec2, req2)
 	if rec2.Code != http.StatusOK {
 		t.Errorf("status = %d; want 200 (cache miss after mutation)", rec2.Code)
 	}
 	etag2 := rec2.Header().Get("ETag")
 	if etag2 == etag1 {
 		t.Errorf("ETag unchanged after body mutation: %q = %q", etag1, etag2)
 	}
 	if !strings.Contains(rec2.Body.String(), "cert-2") {
 		t.Errorf("post-mutation body missing new content: %q", rec2.Body.String())
 	}
 }
 func TestETag_POST_BypassesMiddleware(t *testing.T) {
 	handler := ETag(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		w.WriteHeader(http.StatusCreated)
 		_, _ = w.Write([]byte(`{"id":"cert-new"}`))
 	}))
 	req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates", strings.NewReader(`{}`))
 	rec := httptest.NewRecorder()
 	handler.ServeHTTP(rec, req)
 	if rec.Code != http.StatusCreated {
 		t.Errorf("status = %d; want 201", rec.Code)
 	}
 	if etag := rec.Header().Get("ETag"); etag != "" {
 		t.Errorf("ETag header set on POST response: %q (POST/PUT/DELETE must not have ETag)", etag)
 	}
 }
 func TestETag_5xx_PassesThroughWithoutETag(t *testing.T) {
 	handler := ETag(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		w.WriteHeader(http.StatusInternalServerError)
 		_, _ = w.Write([]byte(`{"error":"boom"}`))
 	}))
 	req := httptest.NewRequest(http.MethodGet, "/api/v1/certificates", nil)
 	rec := httptest.NewRecorder()
 	handler.ServeHTTP(rec, req)
 	if rec.Code != http.StatusInternalServerError {
 		t.Errorf("status = %d; want 500", rec.Code)
 	}
 	if etag := rec.Header().Get("ETag"); etag != "" {
 		t.Errorf("ETag set on 500 response: %q (non-2xx must not be cached)", etag)
 	}
 	if !strings.Contains(rec.Body.String(), "boom") {
 		t.Errorf("error body lost: %q", rec.Body.String())
 	}
 }
 func TestETag_4xx_PassesThroughWithoutETag(t *testing.T) {
 	handler := ETag(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		w.WriteHeader(http.StatusBadRequest)
 		_, _ = w.Write([]byte(`{"error":"invalid query"}`))
 	}))
 	req := httptest.NewRequest(http.MethodGet, "/api/v1/certificates?bad=true", nil)
 	rec := httptest.NewRecorder()
 	handler.ServeHTTP(rec, req)
 	if rec.Code != http.StatusBadRequest {
 		t.Errorf("status = %d; want 400", rec.Code)
 	}
 	if etag := rec.Header().Get("ETag"); etag != "" {
 		t.Errorf("ETag set on 400 response: %q (non-2xx must not be cached)", etag)
 	}
 }
 func TestETag_OversizedResponse_DegradesGracefully(t *testing.T) {
 	// Response larger than maxETagBufferBytes (64 KiB) must not
 	// be ETag'd, but the response itself must reach the client
 	// intact.
 	bigBody := make([]byte, maxETagBufferBytes+1024)
 	for i := range bigBody {
 		bigBody[i] = 'x'
 	}
 	handler := ETag(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		w.Header().Set("Content-Type", "text/plain")
 		_, _ = w.Write(bigBody)
 	}))
 	req := httptest.NewRequest(http.MethodGet, "/api/v1/audit?limit=10000", nil)
 	rec := httptest.NewRecorder()
 	handler.ServeHTTP(rec, req)
 	if rec.Code != http.StatusOK {
 		t.Errorf("status = %d; want 200 (oversize body should not 5xx)", rec.Code)
 	}
 	if etag := rec.Header().Get("ETag"); etag != "" {
 		t.Errorf("ETag emitted for oversize response: %q (should degrade silently)", etag)
 	}
 	if got, want := rec.Body.Len(), len(bigBody); got != want {
 		t.Errorf("body bytes received = %d; want %d (oversize body should not be truncated on the wire)", got, want)
 	}
 }
 func TestETag_Wildcard_MatchesAny(t *testing.T) {
 	// RFC 7232 §3.2: If-None-Match: * matches any current
 	// representation. Clients use this for "give me 304 if anything
 	// exists" semantics.
 	handler := ETag(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		_, _ = w.Write([]byte(`{"any":"thing"}`))
 	}))
 	req := httptest.NewRequest(http.MethodGet, "/api/v1/certificates", nil)
 	req.Header.Set("If-None-Match", "*")
 	rec := httptest.NewRecorder()
 	handler.ServeHTTP(rec, req)
 	if rec.Code != http.StatusNotModified {
 		t.Errorf("status = %d; want 304 (If-None-Match: * always matches)", rec.Code)
 	}
 }
 func TestETag_HEAD_TreatedLikeGET(t *testing.T) {
 	body := []byte(`{"items":[],"total":0}`)
 	handler := ETag(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		// A real HEAD handler wouldn't actually write a body but
 		// the middleware shouldn't care — the ETag derives from
 		// whatever the handler emits.
 		_, _ = w.Write(body)
 	}))
 	req := httptest.NewRequest(http.MethodHead, "/api/v1/certificates", nil)
 	rec := httptest.NewRecorder()
 	handler.ServeHTTP(rec, req)
 	if rec.Code != http.StatusOK {
 		t.Errorf("status = %d; want 200", rec.Code)
 	}
 	if etag := rec.Header().Get("ETag"); etag == "" {
 		t.Errorf("HEAD response missing ETag (HEAD should be treated like GET)")
 	}
 }
 // TestETag_ChainCheck — paranoia check that the recorder doesn't
 // drop bytes vs the underlying ResponseWriter. Reads back the
 // body and asserts byte-equality with what the handler wrote.
 func TestETag_PassThrough_PreservesBody(t *testing.T) {
 	body := []byte(`{"a":1,"b":2,"c":3}`)
 	handler := ETag(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 		_, _ = w.Write(body)
 	}))
 	req := httptest.NewRequest(http.MethodGet, "/api/v1/jobs", nil)
 	rec := httptest.NewRecorder()
 	handler.ServeHTTP(rec, req)
 	got, _ := io.ReadAll(rec.Body)
 	if string(got) != string(body) {
 		t.Errorf("body bytes mismatched: got %q, want %q", string(got), string(body))
 	}
 }
@@ -11,6 +11,7 @@ import (
 	"net/http"
 	"strings"
 	"sync"
 	"sync/atomic"
 	"time"
 	"github.com/google/uuid"
@@ -152,6 +153,14 @@ type RateLimitConfig struct {
 	// PerUserBurstSize overrides BurstSize for authenticated callers.
 	// Zero means "use BurstSize".
 	PerUserBurstSize int
 	// BucketTTL bounds the lifetime of an unused token bucket in the
 	// per-key map. The background sweeper runs every (BucketTTL/4) and
 	// removes entries whose last allow() call is older than BucketTTL.
 	// Zero or negative values fall through to a 1-hour default; values
 	// below 1 minute are clamped up to 1 minute (sweeper sanity).
 	// SEC-006 closure (Sprint 2, 2026-05-16).
 	BucketTTL time.Duration
 }
 // NewRateLimiter creates a per-key token bucket rate limiting middleware.
@@ -166,11 +175,18 @@ type RateLimitConfig struct {
 //   - Unauthenticated: "ip:" + r.RemoteAddr's host portion
 //
 // The bucket map is sync.RWMutex-guarded; create-on-demand for new keys.
-// There is no eviction; for a long-running server with millions of unique
+//
-// IPs this can leak memory. A future enhancement is per-key TTL via a
+// SEC-006 closure (Sprint 2, 2026-05-16). Pre-fix the bucket map had no
-// lazy sweeper. For now the leak is bounded by realistic operator IP
+// eviction, so high-cardinality unauthenticated traffic (CGNAT churn,
-// fan-out and is acceptable per OWASP ASVS L2 (the threat model is abuse
+// Tor exit lists, botnets, infinite-cardinality scanners) grew process
-// by a known set of clients, not infinite-cardinality scanners).
+// memory unboundedly. Each bucket now carries `lastAccess`; a background
 // sweeper goroutine (one per limiter) wakes every (bucketTTL / 4) and
 // removes entries whose lastAccess is older than `bucketTTL`. Default
 // TTL is 1 hour — well above realistic operator IP churn windows so a
 // returning client gets the same bucket, but bounded enough that a
 // scanner's churn is reclaimed within an hour. Operators can override
 // via cfg.BucketTTL (or the CERTCTL_RATE_LIMIT_BUCKET_TTL env var that
 // cmd/server/main.go threads through).
 func NewRateLimiter(cfg RateLimitConfig) func(http.Handler) http.Handler {
 	// Default per-user budgets to the IP-keyed budget when not overridden.
 	perUserRPS := cfg.PerUserRPS
@@ -182,14 +198,33 @@ func NewRateLimiter(cfg RateLimitConfig) func(http.Handler) http.Handler {
 		perUserBurst = float64(cfg.BurstSize)
 	}
 	// SEC-006: bucket TTL eviction. Default 1h; minimum 1m to keep
 	// the sweeper from running pathologically often if an operator
 	// sets a tiny value.
 	bucketTTL := cfg.BucketTTL
 	if bucketTTL <= 0 {
 		bucketTTL = time.Hour
 	}
 	if bucketTTL < time.Minute {
 		bucketTTL = time.Minute
 	}
 	limiter := &keyedRateLimiter{
 		ipRate:    cfg.RPS,
 		ipBurst:   float64(cfg.BurstSize),
 		userRate:  perUserRPS,
 		userBurst: perUserBurst,
 		buckets:   make(map[string]*tokenBucket),
 		bucketTTL: bucketTTL,
 	}
 	// Sweeper goroutine. Single goroutine per limiter; production wires
 	// 2 limiters (default + no-auth-fallback) so the cost is 2 idle
 	// goroutines per server. Lives for the process lifetime; no
 	// shutdown handle is exposed because main.go owns both limiters
 	// for the entire run.
 	go limiter.sweepLoop()
 	return func(next http.Handler) http.Handler {
 		return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
 			key, isUser := rateLimitKey(r)
@@ -231,6 +266,12 @@ func rateLimitKey(r *http.Request) (string, bool) {
 // keyedRateLimiter holds a token bucket per (user-or-ip) key with separate
 // rate / burst defaults for the user-keyed and ip-keyed dimensions.
 //
 // SEC-006: bucketTTL bounds the unused-bucket lifetime; sweepLoop runs
 // in a goroutine spawned by NewRateLimiter and evicts entries whose
 // lastAccess is older than bucketTTL on every (bucketTTL/4) tick.
 // evictedTotal exposes the lifetime eviction count (atomic-loaded by
 // tests and the operator stats endpoint).
 type keyedRateLimiter struct {
 	mu        sync.RWMutex
 	buckets   map[string]*tokenBucket
@@ -238,6 +279,14 @@ type keyedRateLimiter struct {
 	ipBurst   float64
 	userRate  float64
 	userBurst float64
 	bucketTTL    time.Duration
 	evictedTotal atomic.Uint64
 	// sweepTick is the channel sweepLoop ticks on. Default time.Ticker;
 	// tests swap to a manual chan time.Time for deterministic eviction.
 	// Set via the (test-only) seam noted below; production never
 	// reassigns this field.
 	sweepTickCh <-chan time.Time
 }
 func (k *keyedRateLimiter) allow(key string, isUser bool) bool {
@@ -260,22 +309,90 @@ func (k *keyedRateLimiter) allow(key string, isUser bool) bool {
 				burstSize:  burst,
 				tokens:     burst,
 				lastRefill: time.Now(),
 				lastAccess: time.Now(),
 			}
 			k.buckets[key] = tb
 		}
 		k.mu.Unlock()
 	}
-	return tb.allow()
+	allowed := tb.allow()
 	// SEC-006: update lastAccess on every call (cheap; same mutex
 	// the bucket already holds via tb.allow's mu). Sweeper reads
 	// this to decide eviction.
 	tb.touch()
 	return allowed
 }
 // sweepLoop is the background eviction goroutine spawned by
 // NewRateLimiter. It wakes every bucketTTL/4 and removes any bucket
 // whose lastAccess is older than bucketTTL. The (bucketTTL/4) cadence
 // is a compromise — fast enough to keep the map ceiling tight,
 // slow enough that the sweep cost amortises across many requests.
 // SEC-006 closure.
 func (k *keyedRateLimiter) sweepLoop() {
 	// Test seam: if a manual tick channel is wired, use it. Production
 	// always uses time.NewTicker which time.Time-types the channel
 	// identically.
 	if k.sweepTickCh != nil {
 		for range k.sweepTickCh {
 			k.sweep()
 		}
 		return
 	}
 	period := k.bucketTTL / 4
 	if period < time.Second {
 		period = time.Second
 	}
 	t := time.NewTicker(period)
 	defer t.Stop()
 	for range t.C {
 		k.sweep()
 	}
 }
 // sweep removes every bucket whose lastAccess is older than bucketTTL
 // and bumps evictedTotal. Exported for tests via a same-package alias.
 func (k *keyedRateLimiter) sweep() {
 	cutoff := time.Now().Add(-k.bucketTTL)
 	k.mu.Lock()
 	defer k.mu.Unlock()
 	for key, tb := range k.buckets {
 		if tb.lastAccessTime().Before(cutoff) {
 			delete(k.buckets, key)
 			k.evictedTotal.Add(1)
 		}
 	}
 }
 // tokenBucket implements a simple thread-safe token bucket rate limiter.
 // This avoids importing golang.org/x/time/rate to keep dependencies minimal.
 //
 // SEC-006: lastAccess is updated on every allow() call (via touch()) so
 // the keyedRateLimiter sweeper can evict idle buckets without a second
 // per-key map. Guarded by the same mu as rate-limiting state.
 type tokenBucket struct {
 	mu         sync.Mutex
 	rate       float64   // tokens per second
 	burstSize  float64   // max tokens
 	tokens     float64   // current tokens
 	lastRefill time.Time // last refill time
 	lastAccess time.Time // last allow() call — for SEC-006 sweeper
 }
 // touch updates the bucket's lastAccess timestamp under its own mutex.
 // Called from keyedRateLimiter.allow after the rate-limit decision.
 func (tb *tokenBucket) touch() {
 	tb.mu.Lock()
 	tb.lastAccess = time.Now()
 	tb.mu.Unlock()
 }
 // lastAccessTime is the sweeper's read accessor. Uses the bucket's
 // own mutex so the read is consistent with concurrent touch() calls.
 func (tb *tokenBucket) lastAccessTime() time.Time {
 	tb.mu.Lock()
 	defer tb.mu.Unlock()
 	return tb.lastAccess
 }
 func (tb *tokenBucket) allow() bool {
@@ -2,9 +2,11 @@ package middleware
 import (
 	"context"
 	"fmt"
 	"net/http"
 	"net/http/httptest"
 	"testing"
 	"time"
 	"github.com/certctl-io/certctl/internal/auth"
 )
@@ -188,3 +190,94 @@ func TestRateLimiter_M025_EmptyUserKeyTreatedAsAnonymous(t *testing.T) {
 		t.Errorf("second anonymous request from different IP should still pass (independent IP buckets); got %d", rr.Code)
 	}
 }
 // =============================================================================
 // SEC-006 closure (Sprint 2, 2026-05-16). The token-bucket map now has
 // a background sweeper that evicts buckets whose last allow() call is
 // older than the configured BucketTTL. This test pins the eviction
 // path against a synthetic 1000-key load and asserts:
 //
 //   1. Buckets created by N distinct keys land in the map.
 //   2. After the simulated TTL elapses and the sweeper runs, the map
 //      is reclaimed and evictedTotal reflects the count.
 //   3. A subsequent request from a fresh key creates a new bucket
 //      (i.e. the map isn't poisoned by the eviction).
 //
 // The test calls sweep() directly rather than relying on the goroutine
 // + time.Ticker so it stays deterministic and fast. The sweeper
 // goroutine itself is exercised in production; this test pins the
 // eviction predicate.
 // =============================================================================
 func TestKeyedRateLimiter_SweepEvictsIdleBuckets(t *testing.T) {
 	limiter := &keyedRateLimiter{
 		ipRate:    1000,
 		ipBurst:   1000,
 		userRate:  1000,
 		userBurst: 1000,
 		buckets:   make(map[string]*tokenBucket),
 		bucketTTL: 100 * time.Millisecond,
 	}
 	// Populate 1000 buckets from a synthetic IP-key churn.
 	for i := 0; i < 1000; i++ {
 		key := "ip:198.51.100." + fmt.Sprintf("%d", i%256) + "/" + fmt.Sprintf("%d", i)
 		if !limiter.allow(key, false) {
 			t.Fatalf("synthetic IP-key %d: allow returned false on first call", i)
 		}
 	}
 	limiter.mu.RLock()
 	if got := len(limiter.buckets); got != 1000 {
 		limiter.mu.RUnlock()
 		t.Fatalf("post-populate bucket count = %d; want 1000", got)
 	}
 	limiter.mu.RUnlock()
 	// Advance past the TTL boundary, then sweep.
 	time.Sleep(110 * time.Millisecond)
 	limiter.sweep()
 	limiter.mu.RLock()
 	remaining := len(limiter.buckets)
 	limiter.mu.RUnlock()
 	if remaining != 0 {
 		t.Errorf("post-sweep bucket count = %d; want 0 (all should have been evicted)", remaining)
 	}
 	if got := limiter.evictedTotal.Load(); got != 1000 {
 		t.Errorf("evictedTotal = %d; want 1000", got)
 	}
 	// A fresh request creates a new bucket — map isn't poisoned.
 	if !limiter.allow("ip:203.0.113.7", false) {
 		t.Errorf("fresh key: allow returned false on first call after sweep")
 	}
 	limiter.mu.RLock()
 	defer limiter.mu.RUnlock()
 	if got := len(limiter.buckets); got != 1 {
 		t.Errorf("post-sweep-plus-one bucket count = %d; want 1", got)
 	}
 }
 // TestKeyedRateLimiter_SweepKeepsActiveBuckets pins the inverse — a
 // bucket touched within the TTL window survives the sweep. Catches a
 // future regression that inverts the cutoff comparison.
 func TestKeyedRateLimiter_SweepKeepsActiveBuckets(t *testing.T) {
 	limiter := &keyedRateLimiter{
 		ipRate:    1000,
 		ipBurst:   1000,
 		userRate:  1000,
 		userBurst: 1000,
 		buckets:   make(map[string]*tokenBucket),
 		bucketTTL: 1 * time.Hour, // generous so test timing doesn't flake
 	}
 	limiter.allow("ip:198.51.100.42", false)
 	limiter.sweep()
 	limiter.mu.RLock()
 	defer limiter.mu.RUnlock()
 	if got := len(limiter.buckets); got != 1 {
 		t.Errorf("active-bucket count = %d; want 1 (sweep should not evict within TTL)", got)
 	}
 	if got := limiter.evictedTotal.Load(); got != 0 {
 		t.Errorf("evictedTotal = %d; want 0 (no evictions expected)", got)
 	}
 }
@@ -25,6 +25,7 @@ type SecurityHeadersConfig struct {
 	ContentTypeOptions    string // X-Content-Type-Options
 	ReferrerPolicy        string // Referrer-Policy
 	ContentSecurityPolicy string // Content-Security-Policy
 	PermissionsPolicy     string // Permissions-Policy (SEC-008 closure, Sprint 2 ACQ 2026-05-16)
 }
 // SecurityHeadersDefaults returns a recommended baseline.
@@ -32,9 +33,35 @@ type SecurityHeadersConfig struct {
 // CSP: default-src 'self' confines fetches to the same origin.
 // img-src 'self' data: allows inline base64 images (used by the
 // dashboard's certctl-logo and a few status icons).
-// style-src 'self' 'unsafe-inline' is required because Tailwind
+// style-src 'self' 'unsafe-inline' — the 'unsafe-inline' grant
-// (via Vite) injects per-component <style> blocks at build time;
+// is required by React's inline `style={...}` attribute model,
-// without 'unsafe-inline' the dashboard would render unstyled.
+// which emits HTML `style="..."` attributes that the browser
 // treats as inline styles for CSP purposes. The dashboard has 5
 // load-bearing dynamic-style sites: Tooltip's Floating-UI
 // position (left/top px values computed per-tick),
 // AgentFleetPage's dynamic color+width chart bars,
 // dashboard/charts.tsx Recharts color props, CertificatesPage's
 // progress-bar percent width, IssuerHierarchyPage's depth-based
 // marginLeft. The static-pixel uses (UsersPage filter + table UI,
 // DigestPage iframe min-height, AuthProvider demo-mode banner)
 // were migrated to Tailwind utility classes via FE-M6 closure
 // 2026-05-14.
 //
 // FE-M6 audit-framing correction: this comment USED TO say
 // "Tailwind (via Vite) injects per-component <style> blocks at
 // build time." That was factually wrong. Vite's CSS output is a
 // single .css file linked via <link rel="stylesheet"> — verified
 // against dist/index.html post-build: zero <style> tags emitted.
 // The 'unsafe-inline' grant exists for React's style-attribute
 // output path, not for Vite or Tailwind.
 //
 // Fully eliminating 'unsafe-inline' would require either banning
 // dynamic `style={...}` (rewriting the 5 load-bearing sites with
 // a CSS-in-JS library that emits hashed/nonce'd <style> blocks)
 // or adopting CSP nonces with React 18+'s style runtime. Neither
 // fits the original FE-M6 phase budget; tracked as a future
 // security-hardening item.
 //
 // 'unsafe-inline' is intentionally NOT in script-src — the
 // front-end ships as a bundled JS file, no inline scripts.
 //
@@ -52,6 +79,19 @@ type SecurityHeadersConfig struct {
 // Referrer-Policy: no-referrer-when-downgrade — preserves Referer
 // for same-origin navigation (useful for support/diagnostics) but
 // strips it on HTTPS→HTTP transitions.
 //
 // Permissions-Policy: deny-all-browser-features default. Acquisition-
 // audit SEC-008 closure (Sprint 2 ACQ, 2026-05-16). certctl is a
 // control-plane API + dashboard; no part of the surface needs
 // access to the camera, microphone, geolocation, accelerometer,
 // payment, USB, or the deprecated `interest-cohort` (FLoC) browser
 // feature. The deny-all default removes those attack/fingerprint
 // surfaces if certctl is ever embedded in a malicious page or if a
 // dashboard route is XSS-compromised post-CSP-bypass. Operators
 // running certctl with intentional dependence on any of these (e.g.
 // hardware-attestation flows wanting WebAuthn's USB transport) can
 // set `Cfg.PermissionsPolicy: ""` to suppress the header entirely,
 // or override with their own narrowed allowlist.
 func SecurityHeadersDefaults() SecurityHeadersConfig {
 	return SecurityHeadersConfig{
 		HSTS:                  "max-age=31536000; includeSubDomains",
@@ -59,6 +99,7 @@ func SecurityHeadersDefaults() SecurityHeadersConfig {
 		ContentTypeOptions:    "nosniff",
 		ReferrerPolicy:        "no-referrer-when-downgrade",
 		ContentSecurityPolicy: "default-src 'self'; img-src 'self' data:; style-src 'self' 'unsafe-inline'; script-src 'self'; connect-src 'self'; frame-ancestors 'none'",
 		PermissionsPolicy:     "accelerometer=(), camera=(), geolocation=(), microphone=(), payment=(), usb=(), interest-cohort=()",
 	}
 }
@@ -74,7 +115,7 @@ func SecurityHeaders(cfg SecurityHeadersConfig) func(http.Handler) http.Handler
 	// Pre-trim each value once; the per-request hot path stays a
 	// straight set of map writes.
 	type headerEntry struct{ name, value string }
-	entries := make([]headerEntry, 0, 5)
+	entries := make([]headerEntry, 0, 6)
 	add := func(name, value string) {
 		v := strings.TrimSpace(value)
 		if v != "" {
@@ -86,6 +127,7 @@ func SecurityHeaders(cfg SecurityHeadersConfig) func(http.Handler) http.Handler
 	add("X-Content-Type-Options", cfg.ContentTypeOptions)
 	add("Referrer-Policy", cfg.ReferrerPolicy)
 	add("Content-Security-Policy", cfg.ContentSecurityPolicy)
 	add("Permissions-Policy", cfg.PermissionsPolicy)
 	return func(next http.Handler) http.Handler {
 		return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@@ -25,6 +25,7 @@ func TestSecurityHeaders_DefaultsAllPresent(t *testing.T) {
 		"X-Content-Type-Options",
 		"Referrer-Policy",
 		"Content-Security-Policy",
 		"Permissions-Policy",
 	} {
 		if got := rec.Header().Get(h); got == "" {
 			t.Errorf("expected header %q to be set, got empty", h)
@@ -102,3 +103,51 @@ func TestSecurityHeaders_AppliedOnErrorResponses(t *testing.T) {
 		t.Errorf("CSP missing on 401 response")
 	}
 }
 // TestSecurityHeaders_PermissionsPolicyDefault pins the literal value
 // of the default Permissions-Policy header. Acquisition-audit SEC-008
 // closure (Sprint 2 ACQ, 2026-05-16). The deny-all baseline removes
 // camera/microphone/geolocation/accelerometer/payment/USB/interest-cohort
 // attack + fingerprint surfaces — none of which the certctl control
 // plane needs. A regression here (e.g. someone widening to allow
 // camera=*) would surface as a failing test.
 func TestSecurityHeaders_PermissionsPolicyDefault(t *testing.T) {
 	mw := SecurityHeaders(SecurityHeadersDefaults())
 	handler := mw(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
 		w.WriteHeader(http.StatusOK)
 	}))
 	rec := httptest.NewRecorder()
 	handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/", nil))
 	got := rec.Header().Get("Permissions-Policy")
 	if got == "" {
 		t.Fatal("Permissions-Policy missing from default response")
 	}
 	want := "accelerometer=(), camera=(), geolocation=(), microphone=(), payment=(), usb=(), interest-cohort=()"
 	if got != want {
 		t.Errorf("Permissions-Policy default = %q; want %q", got, want)
 	}
 }
 // TestSecurityHeaders_PermissionsPolicyOverrideToEmptySuppresses pins
 // the operator escape hatch: setting Cfg.PermissionsPolicy = "" makes
 // the middleware omit the header entirely (per the per-field empty-
 // string suppression contract), without affecting the other defaults.
 // Acquisition-audit SEC-008 closure (Sprint 2 ACQ, 2026-05-16).
 func TestSecurityHeaders_PermissionsPolicyOverrideToEmptySuppresses(t *testing.T) {
 	cfg := SecurityHeadersDefaults()
 	cfg.PermissionsPolicy = ""
 	mw := SecurityHeaders(cfg)
 	handler := mw(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
 		w.WriteHeader(http.StatusOK)
 	}))
 	rec := httptest.NewRecorder()
 	handler.ServeHTTP(rec, httptest.NewRequest(http.MethodGet, "/", nil))
 	if got := rec.Header().Get("Permissions-Policy"); got != "" {
 		t.Errorf("Permissions-Policy = %q; want empty (operator override-to-empty suppression)", got)
 	}
 	if got := rec.Header().Get("Strict-Transport-Security"); got == "" {
 		t.Errorf("HSTS suppressed too; the empty-string override is per-field")
 	}
 }
@@ -11,6 +11,43 @@ import (
 	"github.com/certctl-io/certctl/internal/auth"
 )
 // etaggedFunc wraps a list-endpoint handler with the SCALE-L2 ETag
 // middleware. Phase 6 SCALE-L2 closure (2026-05-14): the top-5
 // read-heavy list endpoints (/certificates, /jobs, /agents,
 // /audit, /discovered-certificates) get ETag + If-None-Match
 // short-circuit to avoid re-running their SELECT COUNT(*) +
 // row-marshaling pass on every dashboard poll.
 //
 // Call-site shape (rbacGate is OUTER, etaggedFunc is INNER):
 //
 //	r.Register(route, rbacGate(checker, "perm", etaggedFunc(handler)))
 //
 // Wrap order at request time:
 //
 //	request → rbacGate → etaggedFunc → handler
 //
 // Auth runs FIRST. Unauthenticated requests bounce at HTTP 403
 // before the response-buffering ETag middleware ever runs, so the
 // SHA-256-over-body cost only applies to authenticated 2xx
 // responses. This shape is also what TestRouterRBACGateCoverage
 // asserts (the AST CI guard introduced for 2026-05-10 audit CRIT-1
 // requires rbacGate / rbacGateScoped to be the OUTER wrap on every
 // state-changing or read endpoint).
 //
 // Phase 6's initial commit shipped the OPPOSITE order
 // (etagged(rbacGate(handler))) — functionally safe because the ETag
 // middleware emits ETag only on 2xx responses, but it failed the
 // AST coverage test. Phase 8 hotfix (commit see git log --grep=U1000
 // follow-on) inverted the wrap so rbacGate is the outer call.
 //
 // The signature is http.HandlerFunc → http.HandlerFunc (not the
 // http.Handler form) because rbacGate expects http.HandlerFunc as
 // its third arg; nesting an http.Handler-returning helper inside it
 // would type-fail.
 func etaggedFunc(h http.HandlerFunc) http.HandlerFunc {
 	return middleware.ETag(h).ServeHTTP
 }
 // rbacGate wraps a handler with auth.RequirePermission(checker, perm,
 // nil) — i.e. a GLOBAL-SCOPE permission check. Used by RegisterHandlers
 // to gate every state-changing + read endpoint. When checker is nil the
@@ -567,7 +604,7 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
 	r.Register("POST /api/v1/est/certificates/bulk-revoke", rbacGate(reg.Checker, "cert.bulk_revoke", reg.BulkRevocation.BulkRevokeEST))
 	r.Register("POST /api/v1/certificates/bulk-renew", rbacGate(reg.Checker, "cert.issue", reg.BulkRenewal.BulkRenew))
 	r.Register("POST /api/v1/certificates/bulk-reassign", rbacGate(reg.Checker, "cert.edit", reg.BulkReassignment.BulkReassign))
-	r.Register("GET /api/v1/certificates", rbacGate(reg.Checker, "cert.read", reg.Certificates.ListCertificates))
+	r.Register("GET /api/v1/certificates", rbacGate(reg.Checker, "cert.read", etaggedFunc(reg.Certificates.ListCertificates)))
 	r.Register("POST /api/v1/certificates", rbacGate(reg.Checker, "cert.issue", reg.Certificates.CreateCertificate))
 	r.Register("GET /api/v1/certificates/{id}", rbacGate(reg.Checker, "cert.read", reg.Certificates.GetCertificate))
 	r.Register("PUT /api/v1/certificates/{id}", rbacGate(reg.Checker, "cert.edit", reg.Certificates.UpdateCertificate))
@@ -619,7 +656,7 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
 	//   * DELETE /api/v1/agents/{id} — RetireAgent. Replaces the pre-I-004
 	//     hard-delete; the underlying repo does a soft-retire with
 	//     optional cascade.
-	r.Register("GET /api/v1/agents", rbacGate(reg.Checker, "agent.read", reg.Agents.ListAgents))
+	r.Register("GET /api/v1/agents", rbacGate(reg.Checker, "agent.read", etaggedFunc(reg.Agents.ListAgents)))
 	r.Register("POST /api/v1/agents", rbacGate(reg.Checker, "agent.edit", reg.Agents.RegisterAgent))
 	r.Register("GET /api/v1/agents/retired", rbacGate(reg.Checker, "agent.read", reg.Agents.ListRetiredAgents))
 	r.Register("GET /api/v1/agents/{id}", rbacGate(reg.Checker, "agent.read", reg.Agents.GetAgent))
@@ -631,7 +668,7 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
 	r.Register("POST /api/v1/agents/{id}/jobs/{job_id}/status", rbacGate(reg.Checker, "agent.job.complete", reg.Agents.AgentReportJobStatus))
 	// Jobs routes: /api/v1/jobs
-	r.Register("GET /api/v1/jobs", rbacGate(reg.Checker, "job.read", reg.Jobs.ListJobs))
+	r.Register("GET /api/v1/jobs", rbacGate(reg.Checker, "job.read", etaggedFunc(reg.Jobs.ListJobs)))
 	r.Register("GET /api/v1/jobs/{id}", rbacGate(reg.Checker, "job.read", reg.Jobs.GetJob))
 	r.Register("POST /api/v1/jobs/{id}/cancel", rbacGate(reg.Checker, "job.cancel", reg.Jobs.CancelJob))
 	r.Register("POST /api/v1/jobs/{id}/approve", rbacGate(reg.Checker, "approval.approve", reg.Jobs.ApproveJob))
@@ -695,7 +732,7 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
 	r.Register("GET /api/v1/agent-groups/{id}/members", rbacGate(reg.Checker, "agent.read", reg.AgentGroups.ListAgentGroupMembers))
 	// Audit routes: /api/v1/audit
-	r.Register("GET /api/v1/audit", rbacGate(reg.Checker, "audit.read", reg.Audit.ListAuditEvents))
+	r.Register("GET /api/v1/audit", rbacGate(reg.Checker, "audit.read", etaggedFunc(reg.Audit.ListAuditEvents)))
 	// Audit 2026-05-10 HIGH-11 closure — `audit.export` permission was
 	// already seeded into r-admin + r-auditor (migration 000031), but
 	// no endpoint enforced it pre-fix; r-auditor's claim was misleading
@@ -765,7 +802,7 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
 	// Discovery routes: /api/v1/discovered-certificates, /api/v1/discovery-scans
 	r.Register("POST /api/v1/agents/{id}/discoveries", rbacGate(reg.Checker, "discovery.run", reg.Discovery.SubmitDiscoveryReport))
-	r.Register("GET /api/v1/discovered-certificates", rbacGate(reg.Checker, "discovery.read", reg.Discovery.ListDiscovered))
+	r.Register("GET /api/v1/discovered-certificates", rbacGate(reg.Checker, "discovery.read", etaggedFunc(reg.Discovery.ListDiscovered)))
 	r.Register("GET /api/v1/discovered-certificates/{id}", rbacGate(reg.Checker, "discovery.read", reg.Discovery.GetDiscovered))
 	r.Register("POST /api/v1/discovered-certificates/{id}/claim", rbacGate(reg.Checker, "discovery.claim", reg.Discovery.ClaimDiscovered))
 	r.Register("POST /api/v1/discovered-certificates/{id}/dismiss", rbacGate(reg.Checker, "discovery.claim", reg.Discovery.DismissDiscovered))
@@ -7,6 +7,8 @@ import (
 	"net/http/httptest"
 	"strings"
 	"testing"
 	"github.com/certctl-io/certctl/internal/validation"
 )
 // Coverage fill — v2.1.0 release gate Phase 3.
@@ -59,6 +61,54 @@ func TestJWKSStatus_ReturnsSnapshot_AfterAuthRequestPopulatesEntry(t *testing.T)
 	}
 }
 // TestTestDiscovery_RejectsSSRFIssuer_AtEarlyFailRail pins the
 // SEC-001 closure (Sprint 1, 2026-05-16): TestDiscovery refuses
 // reserved-address issuers up-front via validateIssuerSSRF, surfacing
 // a clean "issuer_url failed SSRF policy" error in the result's
 // Errors slice without ever hitting the dial path. The package-wide
 // setup_test.go init() swaps validateIssuerSSRF to a no-op so the
 // other tests can use httptest loopback servers; this test temporarily
 // restores the production gate (validation.ValidateSafeURL) and
 // asserts the rejection fires.
 func TestTestDiscovery_RejectsSSRFIssuer_AtEarlyFailRail(t *testing.T) {
 	prev := validateIssuerSSRF
 	validateIssuerSSRF = validation.ValidateSafeURL
 	defer func() { validateIssuerSSRF = prev }()
 	svc := newServiceForUnitTest(t)
 	cases := []struct {
 		name   string
 		issuer string
 	}{
 		{"loopback_v4", "https://127.0.0.1/realms/certctl"},
 		{"loopback_v6", "https://[::1]/realms/certctl"},
 		{"cloud_metadata", "https://169.254.169.254/latest/meta-data/"},
 		{"link_local_v4", "https://169.254.10.5/realms/certctl"},
 		{"link_local_v6", "https://[fe80::1]/realms/certctl"},
 	}
 	for _, tc := range cases {
 		t.Run(tc.name, func(t *testing.T) {
 			res, err := svc.TestDiscovery(context.Background(), tc.issuer)
 			if err != nil {
 				t.Fatalf("TestDiscovery (non-fatal): %v", err)
 			}
 			if res == nil {
 				t.Fatalf("expected non-nil result")
 			}
 			if res.DiscoverySucceeded {
 				t.Errorf("expected DiscoverySucceeded=false for SSRF issuer; got true")
 			}
 			if len(res.Errors) == 0 {
 				t.Fatalf("expected non-empty Errors slice")
 			}
 			joined := strings.Join(res.Errors, "|")
 			if !strings.Contains(joined, "SSRF policy") {
 				t.Errorf("expected 'SSRF policy' in errors; got %v", res.Errors)
 			}
 		})
 	}
 }
 // TestTestDiscovery_DiscoveryFailure_ReturnsErrorsSlice points
 // TestDiscovery at a URL that doesn't serve a discovery doc; the
 // function MUST return res with DiscoverySucceeded=false and a
@@ -22,6 +22,7 @@ import (
 	"time"
 	authdomain "github.com/certctl-io/certctl/internal/domain/auth"
 	"github.com/certctl-io/certctl/internal/validation"
 )
 // OIDCProvider describes a configured OpenID Connect identity provider
@@ -160,6 +161,16 @@ func (p *OIDCProvider) Validate() error {
 	if _, err := url.Parse(p.IssuerURL); err != nil {
 		return fmt.Errorf("oidc: issuer_url is not a valid URL: %w", err)
 	}
 	// SEC-001 closure (Sprint 1, 2026-05-16): reject reserved-address
 	// issuers (loopback / RFC 1918 / link-local / cloud metadata) at
 	// provider-creation time. Defense-in-depth alongside
 	// oidc.SafeOIDCContext, which is the authoritative dial-time
 	// re-resolution + reject. The static URL check stops the obvious
 	// case ("https://169.254.169.254/...") before the row is persisted
 	// or the dry-run validator runs.
 	if err := validation.ValidateSafeURL(p.IssuerURL); err != nil {
 		return fmt.Errorf("oidc: issuer_url failed SSRF policy: %w", err)
 	}
 	if strings.TrimSpace(p.ClientID) == "" {
 		return ErrOIDCEmptyClientID
 	}
@@ -82,6 +82,41 @@ func TestOIDCProvider_Validate_RejectsNonHTTPSIssuer(t *testing.T) {
 	}
 }
 // SEC-001 closure (Sprint 1, 2026-05-16). The IssuerURL Validate gate
 // now refuses reserved-address issuers (loopback, RFC 1918,
 // link-local, IPv6 loopback, IPv6 link-local, cloud metadata) so a
 // row claiming https://127.0.0.1/... or https://169.254.169.254/...
 // never makes it to the persistence layer or the runtime discovery
 // dial. Authoritative dial-time rejection lives in
 // internal/validation.SafeHTTPDialContext (DNS-rebinding-safe); this
 // test pins the static URL gate that surfaces the policy violation
 // with a clean error before any network I/O.
 func TestOIDCProvider_Validate_RejectsSSRFIssuer(t *testing.T) {
 	cases := []struct {
 		name   string
 		issuer string
 	}{
 		{"loopback_v4", "https://127.0.0.1/realms/certctl"},
 		{"loopback_v6", "https://[::1]/realms/certctl"},
 		{"cloud_metadata", "https://169.254.169.254/latest/meta-data/"},
 		{"link_local_v4", "https://169.254.10.5/realms/certctl"},
 		{"link_local_v6", "https://[fe80::1]/realms/certctl"},
 	}
 	for _, tc := range cases {
 		t.Run(tc.name, func(t *testing.T) {
 			p := validProvider()
 			p.IssuerURL = tc.issuer
 			err := p.Validate()
 			if err == nil {
 				t.Fatalf("issuer=%q: Validate returned nil; want SSRF policy rejection", tc.issuer)
 			}
 			if !strings.Contains(err.Error(), "SSRF policy") {
 				t.Errorf("issuer=%q: err=%v; want error mentioning SSRF policy", tc.issuer, err)
 			}
 		})
 	}
 }
 func TestOIDCProvider_Validate_RejectsEmptyClientID(t *testing.T) {
 	p := validProvider()
 	p.ClientID = ""
@@ -0,0 +1,122 @@
 // Copyright 2026 certctl LLC. All rights reserved.
 // SPDX-License-Identifier: BUSL-1.1
 package oidc
 // SEC-001 closure (Sprint 1, 2026-05-16). Pre-fix, two OIDC discovery
 // call sites passed the bare request context to gooidc.NewProvider:
 //
 //   - test_discovery.go:65  (dry-run validator from the GUI)
 //   - service.go:1066       (runtime provider load on first cache miss)
 //
 // Acquisition-audit follow-up SEC-020 + SEC-021 (Sprint 1 follow-up,
 // 2026-05-16) extended the same wrap to two adjacent call sites that
 // the original SEC-001 sweep missed:
 //
 //   - service.go::fetchUserinfoGroups (~L948-961, SEC-020 closure) —
 //     the userinfo-fallback path called entry.provider.UserInfo(ctx, ts)
 //     with bare ctx. go-oidc/v3 Provider.UserInfo derives its HTTP
 //     client from the context via getClient(ctx) (oidc.go:61-65);
 //     without an override, the internal doRequest falls through to
 //     http.DefaultClient.
 //   - internal/api/handler/auth_session_oidc_bcl.go::Verify (~L125,
 //     SEC-021 closure) — the back-channel-logout verifier performs a
 //     per-request discovery re-fetch via gooidc.NewProvider(ctx, ...)
 //     with bare ctx; SafeOIDCContext now wraps before the call.
 //
 // Context-key shape: gooidc.ClientContext is implemented as
 //   context.WithValue(ctx, oauth2.HTTPClient, client)
 // (go-oidc v3.18.0 oidc.go:57-59). Both go-oidc's getClient AND
 // golang.org/x/oauth2's internal.ContextClient read oauth2.HTTPClient,
 // so the SINGLE SafeOIDCContext wrap covers go-oidc-driven HTTP calls
 // (Provider.UserInfo / NewProvider discovery / Verifier JWKS) AND
 // oauth2-driven HTTP calls (Config.TokenSource refresh / Exchange).
 // No additional context.WithValue(ctx, oauth2.HTTPClient, ...) is
 // required alongside the wrap.
 //
 // gooidc.NewProvider derives its HTTP client from the context via
 // oidc.ClientContext; with no override it falls through to
 // http.DefaultClient. The default client has no SSRF guard, so an admin
 // with `auth.oidc.create` could induce server-side HTTPS egress to
 // loopback (127.0.0.1, ::1), RFC 1918 (10/8 / 172.16/12 / 192.168/16),
 // link-local (169.254.169.254 — cloud-instance metadata), and IPv6
 // link-local (fe80::/10).
 //
 // The companion JWKS reachability probe (jwksReachable + jwksProbeClient
 // in this package) was already routed through SafeHTTPDialContext via
 // the Bundle 5 R6 closure; the discovery + claims path bypassed that
 // guard.
 //
 // This file adds the symmetric guard for the discovery leg:
 //
 //   - oidcDiscoveryClient — an *http.Client wrapping a Transport whose
 //     DialContext is SafeHTTPDialContext, sized to the same outbound
 //     budget as jwksProbeClient (oidcOutboundTimeout = 10s).
 //   - SafeOIDCContext(ctx) — returns a context that gooidc.NewProvider
 //     and the resulting Verifier will use for every outbound call.
 //
 // The two call sites above are rewritten to thread their context through
 // SafeOIDCContext before NewProvider runs. The fail-closed posture is
 // owned by validation.SafeHTTPDialContext — DNS-rebinding-safe by
 // re-resolving at dial time and rejecting any reserved address that
 // surfaces in the resolution.
 //
 // Defense-in-depth: domain/types.go.Validate also calls
 // validation.ValidateSafeURL on the persisted IssuerURL at provider-
 // creation time so reserved-address issuers fail before they ever reach
 // the cache + dial path.
 import (
 	"context"
 	"net/http"
 	"time"
 	gooidc "github.com/coreos/go-oidc/v3/oidc"
 	"github.com/certctl-io/certctl/internal/validation"
 )
 // oidcDiscoveryClient is the *http.Client gooidc.NewProvider uses for
 // the discovery doc fetch + the per-Verifier JWKS read it issues
 // internally on first sig-verify. Routed through SafeHTTPDialContext
 // so the dial-time guard re-resolves the issuer host and rejects
 // loopback / link-local / private / cloud-metadata before any HTTP
 // byte goes out. Mirrors jwksProbeClient (test_discovery.go) so both
 // outbound paths share an identical SSRF posture.
 //
 // Package-level var so the test suite can swap it for an
 // SSRF-guard-bypassed client when exercising the discovery code path
 // against httptest.NewServer (which binds to 127.0.0.1 and would
 // otherwise be refused). Mirrors the webhook/slack/teams test-seam
 // pattern. Production code never reassigns this var.
 var oidcDiscoveryClient = &http.Client{
 	Timeout: oidcOutboundTimeout,
 	Transport: &http.Transport{
 		DialContext:           validation.SafeHTTPDialContext(oidcOutboundTimeout),
 		MaxIdleConns:          10,
 		IdleConnTimeout:       90 * time.Second,
 		TLSHandshakeTimeout:   10 * time.Second,
 		ExpectContinueTimeout: 1 * time.Second,
 	},
 }
 // SafeOIDCContext returns a derived context that carries the SSRF-safe
 // discovery http.Client. Pass the result to gooidc.NewProvider so that
 // the discovery doc fetch + the internal JWKS fetch the resulting
 // Verifier issues both run through SafeHTTPDialContext.
 //
 // Callers SHOULD use this wrapper for every gooidc.NewProvider call
 // site; the package's own callers (service.go runtime load,
 // test_discovery.go dry-run validator) do this unconditionally.
 func SafeOIDCContext(ctx context.Context) context.Context {
 	return gooidc.ClientContext(ctx, oidcDiscoveryClient)
 }
 // validateIssuerSSRF is the package-level seam tests substitute for the
 // static issuer-URL SSRF gate. Production callers always run through
 // validation.ValidateSafeURL; tests using httptest.NewServer (which
 // binds to 127.0.0.1) swap this to a no-op in setup_test.go so the
 // loopback URL doesn't trip the early-fail rail. Mirrors the
 // jwksProbeClient / oidcDiscoveryClient test-seam pattern. Production
 // code MUST NOT reassign this var.
 var validateIssuerSSRF = validation.ValidateSafeURL
@@ -948,8 +948,19 @@ func (s *Service) fetchUserinfoGroups(
 	if entry.provider.UserInfoEndpoint() == "" {
 		return nil, fmt.Errorf("oidc: userinfo fallback configured but provider has no userinfo endpoint")
 	}
-	ts := entry.oauthConfig.TokenSource(ctx, token)
+	// Acquisition-audit SEC-020 closure (Sprint 1 follow-up to SEC-001,
-	uinfo, err := entry.provider.UserInfo(ctx, ts)
+	// 2026-05-16). Wrap ctx via SafeOIDCContext before TokenSource +
 	// UserInfo so the SSRF guard owned by validation.SafeHTTPDialContext
 	// re-resolves the userinfo endpoint at dial time and refuses reserved
 	// addresses (loopback / link-local / cloud-metadata). The single wrap
 	// covers both legs because gooidc.ClientContext and oauth2.TokenSource
 	// both read the same oauth2.HTTPClient context key (see go-oidc/v3
 	// oidc.go:57-65 and golang.org/x/oauth2 oauth2.go:339-341). Production
 	// provider-load paths in this package already use SafeOIDCContext; the
 	// userinfo fallback was missed in the SEC-001 sweep.
 	safeCtx := SafeOIDCContext(ctx)
 	ts := entry.oauthConfig.TokenSource(safeCtx, token)
 	uinfo, err := entry.provider.UserInfo(safeCtx, ts)
 	if err != nil {
 		return nil, fmt.Errorf("oidc: userinfo fetch: %w", err)
 	}
@@ -1063,7 +1074,14 @@ func (s *Service) getOrLoad(ctx context.Context, providerID string) (*providerEn
 	}
 	// Fetch + cache the discovery doc + JWKS via go-oidc.
-	provider, err := gooidc.NewProvider(ctx, cfgRow.IssuerURL)
+	//
 	// SEC-001 closure (Sprint 1, 2026-05-16): the bare `ctx` is wrapped
 	// in SafeOIDCContext so the discovery fetch + every subsequent
 	// Verifier-issued JWKS fetch run through validation.SafeHTTPDialContext.
 	// Pre-fix this path used http.DefaultClient and could be aimed at
 	// loopback / RFC 1918 / link-local / cloud-metadata addresses via the
 	// admin-supplied issuer URL. See safehttp.go for the full closure note.
 	provider, err := gooidc.NewProvider(SafeOIDCContext(ctx), cfgRow.IssuerURL)
 	if err != nil {
 		return nil, fmt.Errorf("oidc: discovery fetch failed for %s: %w", providerID, err)
 	}
@@ -19,11 +19,15 @@ import (
 	"github.com/go-jose/go-jose/v4"
 	"github.com/go-jose/go-jose/v4/jwt"
 	"golang.org/x/oauth2"
 	gooidc "github.com/coreos/go-oidc/v3/oidc"
 	oidcdomain "github.com/certctl-io/certctl/internal/auth/oidc/domain"
 	userdomain "github.com/certctl-io/certctl/internal/auth/user/domain"
 	cryptopkg "github.com/certctl-io/certctl/internal/crypto"
 	"github.com/certctl-io/certctl/internal/repository"
 	"github.com/certctl-io/certctl/internal/validation"
 )
 // sha384New returns a SHA-384 hash via crypto/sha512 (Go stdlib).
@@ -392,6 +396,20 @@ func (s *stubUsers) ListAll(_ context.Context, _ string) ([]*userdomain.User, er
 	return out, nil
 }
 // ListDeactivatedBefore satisfies the Sprint 6 COMP-002-RETENTION
 // interface addition. Stub-side: walk byID and filter on the
 // DeactivatedAt cursor; OIDC service tests don't care about ordering
 // stability.
 func (s *stubUsers) ListDeactivatedBefore(_ context.Context, threshold time.Time) ([]*userdomain.User, error) {
 	var out []*userdomain.User
 	for _, u := range s.byID {
 		if u.DeactivatedAt != nil && u.DeactivatedAt.Before(threshold) {
 			out = append(out, u)
 		}
 	}
 	return out, nil
 }
 type stubSessions struct {
 	cookieValue string
 	csrfToken   string
@@ -2386,3 +2404,106 @@ func TestService_UpsertUser_ValidateErrorOnEmptyEmail(t *testing.T) {
 		t.Errorf("err = %v; want validate wrap", err)
 	}
 }
 // Acquisition-audit SEC-020 closure (Sprint 1 follow-up to SEC-001,
 // 2026-05-16). fetchUserinfoGroups previously called
 // entry.provider.UserInfo(ctx, ts) with the bare request context. go-oidc
 // /v3's Provider.UserInfo derives its http.Client from ctx via
 // getClient(ctx) (oidc.go:61-65); without an override the internal
 // doRequest falls through to http.DefaultClient — an unwrapped client
 // with no SSRF guard. The fix wraps ctx via SafeOIDCContext so the
 // dial-time SafeHTTPDialContext guard re-resolves the userinfo
 // endpoint and rejects reserved-address answers.
 //
 // This test exercises the wrap end-to-end:
 //
 //  1. Stand up a discovery httptest server (loopback) whose discovery
 //     doc advertises userinfo_endpoint = "http://169.254.169.254/userinfo"
 //     (link-local cloud-metadata range — rejected by
 //     validation.SafeHTTPDialContext.isReservedIPForDial).
 //  2. Construct the *gooidc.Provider via the test-bypassed
 //     oidcDiscoveryClient (setup_test.go's init() leaves it bypassed for
 //     the package).
 //  3. Restore the production-shape oidcDiscoveryClient (the one whose
 //     Transport.DialContext is validation.SafeHTTPDialContext) BEFORE
 //     calling fetchUserinfoGroups, so the SafeOIDCContext wrap inside
 //     the function captures the production guard at ctx-wrap time.
 //  4. Call fetchUserinfoGroups and assert the resulting error wraps the
 //     dial-time reserved-address rejection (substring "refusing to
 //     dial" / "reserved address"), not a generic transport error.
 //
 // The test does NOT use t.Parallel() — it mutates the package-level
 // oidcDiscoveryClient and must run serially against any other test that
 // reads the same var.
 func TestFetchUserinfoGroups_SSRF_BlocksReservedAddress(t *testing.T) {
 	// Stand up a loopback discovery server. Discovery doc's
 	// userinfo_endpoint points at the link-local cloud-metadata IP so
 	// the subsequent UserInfo dial trips SafeHTTPDialContext.
 	var discoveryURL string
 	mux := http.NewServeMux()
 	mux.HandleFunc("/.well-known/openid-configuration", func(w http.ResponseWriter, r *http.Request) {
 		doc := map[string]interface{}{
 			"issuer":                                discoveryURL,
 			"authorization_endpoint":                discoveryURL + "/authorize",
 			"token_endpoint":                        discoveryURL + "/token",
 			"jwks_uri":                              discoveryURL + "/jwks",
 			"userinfo_endpoint":                     "http://169.254.169.254/userinfo",
 			"id_token_signing_alg_values_supported": []string{"RS256"},
 			"response_types_supported":              []string{"code"},
 			"subject_types_supported":               []string{"public"},
 		}
 		w.Header().Set("Content-Type", "application/json")
 		_ = json.NewEncoder(w).Encode(doc)
 	})
 	srv := httptest.NewServer(mux)
 	defer srv.Close()
 	discoveryURL = srv.URL
 	// Build the *gooidc.Provider using the test-bypassed discovery
 	// client (setup_test.go init() already swapped oidcDiscoveryClient
 	// to a DefaultTransport-backed client so the httptest loopback URL
 	// resolves cleanly).
 	ctx := context.Background()
 	provider, err := gooidc.NewProvider(SafeOIDCContext(ctx), discoveryURL)
 	if err != nil {
 		t.Fatalf("NewProvider against loopback discovery server: %v", err)
 	}
 	if got := provider.UserInfoEndpoint(); got != "http://169.254.169.254/userinfo" {
 		t.Fatalf("provider.UserInfoEndpoint() = %q; want link-local override", got)
 	}
 	// Restore the production-shape SafeHTTPDialContext-backed client
 	// just before the call. SafeOIDCContext inside fetchUserinfoGroups
 	// will pick THIS client up because gooidc.ClientContext reads the
 	// package-level var at wrap time.
 	saved := oidcDiscoveryClient
 	t.Cleanup(func() { oidcDiscoveryClient = saved })
 	oidcDiscoveryClient = &http.Client{
 		Timeout: oidcOutboundTimeout,
 		Transport: &http.Transport{
 			DialContext: validation.SafeHTTPDialContext(oidcOutboundTimeout),
 		},
 	}
 	entry := &providerEntry{
 		provider: provider,
 		oauthConfig: &oauth2.Config{
 			ClientID:     "test-client",
 			ClientSecret: "test-secret",
 			Endpoint:     oauth2.Endpoint{TokenURL: discoveryURL + "/token"},
 		},
 	}
 	svc := &Service{}
 	_, err = svc.fetchUserinfoGroups(ctx, entry, &oauth2.Token{AccessToken: "test-access-token"}, "groups")
 	if err == nil {
 		t.Fatal("fetchUserinfoGroups against link-local userinfo endpoint: expected SSRF reject; got nil")
 	}
 	msg := err.Error()
 	// SafeHTTPDialContext emits one of two messages for the literal-IP
 	// case: "refusing to dial reserved address <ip>". Either is the
 	// load-bearing signal we want — a generic connect-refused / EOF
 	// would mean the guard didn't fire.
 	if !strings.Contains(msg, "refusing to dial") && !strings.Contains(msg, "reserved address") {
 		t.Errorf("fetchUserinfoGroups err = %q; want SafeHTTPDialContext reserved-address rejection", msg)
 	}
 }
@@ -29,4 +29,14 @@ func init() {
 		Timeout:   10 * time.Second,
 		Transport: http.DefaultTransport,
 	}
 	// SEC-001 closure companion: same SSRF-bypass for the discovery
 	// fetch's http.Client + the static issuer-URL gate. Tests using
 	// httptest.NewServer get a loopback URL; the production
 	// SafeHTTPDialContext + validateIssuerSSRF would reject these.
 	// Production code never reassigns either var.
 	oidcDiscoveryClient = &http.Client{
 		Timeout:   10 * time.Second,
 		Transport: http.DefaultTransport,
 	}
 	validateIssuerSSRF = func(string) error { return nil }
 }
@@ -58,11 +58,31 @@ type TestDiscoveryResult struct {
 func (s *Service) TestDiscovery(ctx context.Context, issuerURL string) (*TestDiscoveryResult, error) {
 	res := &TestDiscoveryResult{}
 	// SEC-001 closure (Sprint 1, 2026-05-16): refuse reserved-address
 	// issuers up-front so operators see a clear policy error instead
 	// of the lower-level dial-rejection wrap from SafeHTTPDialContext.
 	// The dial-time guard remains the authoritative DNS-rebinding-safe
 	// defense; this is the early-fail UX rail. Routed through the
 	// validateIssuerSSRF package-level seam so tests using
 	// httptest.NewServer can swap it for a no-op (see setup_test.go).
 	if vErr := validateIssuerSSRF(issuerURL); vErr != nil {
 		res.Errors = append(res.Errors, fmt.Sprintf("issuer_url failed SSRF policy: %v", vErr))
 		return res, nil
 	}
 	// Step 1 — discovery. gooidc.NewProvider fetches
 	// `<issuer>/.well-known/openid-configuration` and runs the iss
 	// match check internally; on failure it returns a fmt-style
 	// wrapped error.
-	provider, err := gooidc.NewProvider(ctx, issuerURL)
+	//
 	// SEC-001 closure (Sprint 1, 2026-05-16): the bare `ctx` is wrapped
 	// in SafeOIDCContext so the discovery fetch + the resulting
 	// Verifier's internal JWKS fetch both run through a transport
 	// whose DialContext is validation.SafeHTTPDialContext. Pre-fix the
 	// default HTTP client could be aimed at loopback / RFC 1918 /
 	// link-local / cloud-metadata addresses via the admin-supplied
 	// issuer URL. See safehttp.go for the full closure note.
 	provider, err := gooidc.NewProvider(SafeOIDCContext(ctx), issuerURL)
 	if err != nil {
 		res.Errors = append(res.Errors, fmt.Sprintf("discovery fetch failed: %v", err))
 		return res, nil // Non-fatal at this layer; the response carries the per-leg failure.
--- a/Show More
+++ b/Show More