Commit Graph

291 Commits

Author SHA1 Message Date
shankar0123 6acf3559a3 docs(scale): TEST-005 — split scale baseline into its own canonical record
Sprint 5 unified-master-audit closure. Pre-fix:

  - docs/operator/scale.md L163-185 held a TBD-laden table with 5
    scenario rows. The Phase 8 scenarios shipped 2026-05-14; baseline
    capture on canonical hardware was 'the next operational step'
    that had not been taken.
  - Acquirers + operators asking 'what's the scale ceiling?' got
    'TBD' as the in-tree answer.

The audit's fix wanted three things:
  1. Capture p50/p95/p99 + error rate + memory profile on a fixed-
     spec runner.
  2. Replace the scale.md TBD rows with real numbers.
  3. Archive k6 artifacts under deploy/test/loadtest-artifacts/.

The actual capture is a workflow_dispatch run the operator triggers
on a real Linux runner — it can't happen from a sandbox without
Docker. What I CAN deliver in this commit is the canonical-record
infrastructure that turns the next workflow run into a baseline that
sticks:

  - New docs/operator/scale-baseline-2026-Q2.md is the canonical
    record. Documents the three scenarios, the methodology, the
    capture procedure, and a 'Latest capture' table with
    placeholder rows ready to receive the workflow_dispatch run's
    numbers. The doc explicitly defends the 'ubuntu-latest runner'
    choice (reproducibility > paid-AWS-account specificity).
  - docs/operator/scale.md L163-185 — the TBD table — replaced with
    a pointer paragraph to the new baseline file. Per the
    canonical-doc-pointer pattern: the operator-posture doc changes
    when scenarios change; the baseline doc changes on every
    capture. Splitting them avoids review-noise on per-capture
    commits.
  - New deploy/test/loadtest-artifacts/ directory with a README
    documenting the long-term-archive contract (the GHA artifact
    retention is 90 days; numbers acquisition reviewers look at
    months later need a committed home).

Operator next steps to fill the placeholders:
  1. Trigger Actions → loadtest → Run workflow.
  2. Download the three matrix-leg artifacts.
  3. Update the baseline doc's 'Latest capture' rows.
  4. Commit the raw artifacts (or git-lfs for >100 MB archives) to
     deploy/test/loadtest-artifacts/.

Closes TEST-005 (infrastructure side). Numbers land on the next
canonical-runner workflow_dispatch capture.
2026-05-16 05:19:57 +00:00
shankar0123 3e09401502 test(ci): TEST-003 — flip Frontend E2E from informational to merge-gate
Sprint 5 unified-master-audit closure. The Phase 8 E2E workflow at
.github/workflows/e2e.yml shipped with continue-on-error: true and
a header banner that said it would be promoted to required-for-merge
once 1-2 weeks of green runs accumulated. The accumulation happened;
the flip didn't.

Ground-truth via api.github.com/repos/certctl-io/certctl/actions/runs
(2026-05-16): 14 consecutive green runs across 2026-05-14 to
2026-05-15 (heaviest Sprint 1-4 frontend churn in the repo's history,
6 commits touching web/**) confirmed the suite is stable. No flakes,
no flaps, no timeouts.

Fix:
  - .github/workflows/e2e.yml continue-on-error: true → false.
  - Workflow name strips the '(informational)' tag.
  - Header banner rewritten to reflect the new posture + flag the
    one operator action still required (adding the job to the
    branch-protection required-checks list at
    https://github.com/certctl-io/certctl/settings/branches).
  - New docs/operator/runbooks/e2e-snapshot-update.md documents the
    visual-regression snapshot-bump workflow now that a red E2E
    run blocks merge. Includes the standard (one or two affected
    tests) + mass-bump (font upgrade / framework migration) paths,
    plus an explicit anti-patterns section (do NOT regenerate from
    a developer's local machine; do NOT add --update-snapshots to
    the always-run step).

Closes TEST-003.
2026-05-16 05:19:38 +00:00
shankar0123 c95685f8ab docs(arch): ARCH-002-MT — document single-tenant model + tenant_id scaffolding
Sprint 4 unified-master-audit closure. Every table that joins on a
tenant identifier (managed_certificates, agents, users, roles, audit
log, etc.) has a tenant_id column. The auth middleware at
internal/auth/middleware.go:97 stamps every authenticated request
with auth.DefaultTenantID. Repository queries don't filter on
tenant. A repo skimmer sees the columns and reasonably assumes
multi-tenancy is wired end-to-end. It isn't.

This was a diligence trap: a buyer planning multi-tenant SaaS
post-acquisition would inspect the schema, conclude the
foundation is in place, and discover at integration time that the
constant-tenant invariant is hard-coded across the request layer.

Fix: docs/reference/architecture.md grows a 'Single-tenant
deployment model' subsection in Design Principles that states
plainly:
  - every authenticated request carries DefaultTenantID
  - tenant_id columns are forward-compatible scaffolding for the
    multi-tenancy roadmap item in WORKSPACE-ROADMAP.md
  - lifting to multi-tenant requires three pieces in sequence:
      (1) request-derived tenant resolution
      (2) per-query tenant scoping
      (3) the multi-tenant-query-coverage CI guard becoming
          a hard gate
  - until that work lands, the multi-tenant columns are decorative

The doc points at scripts/ci-guards/multi-tenant-query-coverage.sh
(which tracks tenant_id-less query drift as an informational
warning today) and explains the inflection point for flipping it
to hard-gate. '> Last reviewed:' bumped to today.

This is a docs-only commit. No runtime behavior change.

Closes ARCH-002-MT.
2026-05-16 04:55:50 +00:00
shankar0123 a0404f2d21 fix(docs,code): ARCH-004 + SEC-003-K8S + ARCH-003 — marketing claims now match code truth
Sprint 4 unified-master-audit closure. Three claim-truth-alignment
findings whose README edits land on shared lines, bundled into one
commit.

ARCH-004 — 'full REST API exposed as MCP tools' overclaim:
  Pre-fix the README said 'the full REST API is exposed as MCP
  tools'; the actual MCP coverage is 162 tools / 220 routes
  (~74%). The remaining gap is intentional: protocol-conformance
  endpoints (ACME/SCEP/EST/OCSP/CRL), browser-only auth flow,
  health/ready, and streaming/binary downloads — categories that
  don't fit the request-response JSON tool shape.

  Fix:
    - README L78 qualified to 'the bulk of the REST API surface'
      with explicit numbers + pointer to the new coverage doc.
    - New docs/reference/mcp-coverage.md publishes the exclusion
      categories with rationale + the canonical commands to
      re-derive route + tool counts.
    - New scripts/ci-guards/mcp-coverage-parity.sh fails the build
      if the tool count drops below (routes − exclusions − 40-slack),
      so a future regression that drops 50+ tools surfaces in CI.
      Verified locally: clean at 162 tools / 220 routes / 37
      intentional exclusions.

SEC-003-K8S — Kubernetes Secrets connector is a runtime stub:
  Pre-fix README L67 marketed 'fifteen native target connectors'
  with Kubernetes Secrets in the list, but realK8sClient's CRUD
  methods returned 'real Kubernetes client not implemented' in
  production. Per the audit's option (b) recommendation: downgrade
  marketing + runtime-guard the stub.

  Fix:
    - README L12 + L67: 'fourteen production-ready native deployment-
      target connectors plus Kubernetes Secrets (preview)'.
    - k8ssecret.New() now refuses to construct unless
      CERTCTL_K8SSECRET_PREVIEW_ACK=true is set, mirroring the
      SEC-H3 ACK pattern. NewWithClient path (test injection)
      unchanged.
    - docs/reference/connectors/index.md moves Kubernetes Secrets
      out of the canonical fourteen-target list into a new 'Preview
      connectors' subsection.
    - Regression tests in k8ssecret_test.go pin the new gate
      (rejects without ACK, accepts with ACK, still rejects nil
      config even with ACK).

ARCH-003 — CERTCTL_KEYGEN_MODE=server breaks the blanket claim:
  Pre-fix README L12 + L82 said 'private keys stay on your
  infrastructure' and 'never touch the control plane' as blanket
  promises. Flipping CERTCTL_KEYGEN_MODE=server makes the control
  plane mint keys in process memory — breaking the claim — and
  the only signal was a boot-time slog WARN. An operator who set
  the flag and didn't read logs ran in silent contradiction to the
  marketed posture.

  Fix:
    - config.Validate() refuses to accept KeygenMode='server'
      unless DemoModeAck=true (mirroring SEC-H3). Production
      deploys (the default Mode='agent' path) are unaffected.
    - README L12 + L82 qualified: 'In agent-mode (the default),
      private keys ...; a demo-only CERTCTL_KEYGEN_MODE=server
      flag mints keys server-side, refuses to start without an
      explicit CERTCTL_DEMO_MODE_ACK=true acknowledgement.'
    - Regression tests for the new Validate gate land in
      config_test.go (note: gate tests landed in the ARCH-002
      commit because of contiguous-hunk constraint at the bottom
      of the file).

Closes ARCH-004, SEC-003-K8S, ARCH-003.
2026-05-16 04:55:34 +00:00
shankar0123 3ce05ab0a8 docs(runbook): DEPL-005 — rewrite postgres-backup automation paths to reference the shipped CronJob
Sprint 3 unified-master-audit closure. docs/operator/runbooks/postgres-backup.md
sections 110-143 still said 'certctl ships no backup CronJob template
in the Helm chart' and the three sample recipes that followed
included an 'in-cluster Postgres → S3' rollup that the operator
'should roll their own.' But the chart actually DOES ship that
CronJob:

  deploy/helm/certctl/templates/backup-cronjob.yaml (Phase 4
  DEPL-H2 closure, 2026-05-14) — opt-in via 'backup.enabled: true',
  PVC + S3 sinks, pg_dump shape byte-comparable with the manual
  command earlier in the runbook.

Operators following the pre-fix runbook would write a duplicate
CronJob from scratch while the working template sat unused under
their nose.

Rewrite of sections 110-143:
  - Lead with the shipped CronJob, two install one-liners (PVC + S3).
  - Move the recipes-by-topology block down to 'When the bundled
    CronJob is NOT the answer' — still call out managed Postgres
    (use provider PITR) and bare-VM Postgres (systemd + pg_dump +
    restic) as deliberately out-of-scope.
  - Add 'Recovery objectives' subsection: RPO ≈ 24h at the default
    nightly schedule, RTO ≈ 30-60min from the existing drill steps
    further down the page. Tells the reader where the bundled
    CronJob fits in their RPO/RTO budget without overpromising
    (anything below 24h RPO needs WAL-shipping, which the CronJob
    doesn't do).
  - Bump '> Last reviewed:' to today.

Closes DEPL-005.
2026-05-16 04:31:31 +00:00
shankar0123 c40690e42d docs(testing): regenerate skip-inventory after SEC-001 types_test.go edit (CI guard skip-inventory-drift)
SEC-001's TestOIDCProvider_Validate_RejectsSSRFIssuer addition
in internal/auth/oidc/domain/types_test.go shifted an existing
t.Skip site from line 186 → line 221. The auto-generated
inventory at docs/testing/skip-inventory.md still pointed at
the old line, so scripts/ci-guards/skip-inventory-drift.sh
failed the build.

Regenerated via scripts/skip-inventory.sh and bumped the
'> Last reviewed:' header. Inventory now matches the live
tree exactly.
2026-05-16 04:15:35 +00:00
shankar0123 c8985cf868 fix(ratelimit): Hotfix #5 — Postgres timestamptz[] scan + skip-inventory drift
Two CI hotfixes surfaced by master CI on 29cb13e7 (Sprint 13.6 tip
before the Sprint 13.7 closure landed):

1. TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas failed with
   "pq: scanning to time.Time is not implemented; only sql.Scanner".
   Root cause: time.Time does not implement sql.Scanner, and lib/pq's
   pq.GenericArray scan path calls element-Scan() directly rather than
   database/sql's convertAssign (which DOES support time conversions).
   So `pq.Array(&[]time.Time{})` reliably fails on read even though
   the symmetric write `pq.Array([]time.Time{...})` works (the write
   path uses driver.Value() which time.Time implements).

   Fix: cast the timestamptz[] to a text[] of canonical ISO 8601 UTC
   strings at the SQL boundary via to_char(t AT TIME ZONE 'UTC',
   'YYYY-MM-DD"T"HH24:MI:SS.US"Z"'), read via pq.StringArray (well-
   supported), and parse Go-side with layout "2006-01-02T15:04:05.000000Z".
   The format is fully deterministic regardless of the session's
   DateStyle or TimeZone settings.

   Touched: internal/ratelimit/postgres_sliding_window.go (Step 2 of
   the Allow() transaction — locking + read).

   Falsifiable proof on CI: the failing test
   TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas
   (100 concurrent Allow calls / 3 replicas / cap=10) must now produce
   exactly 10 succeed / 90 ErrRateLimited. Pre-fix it produced 1 / 0
   because every Allow after the first crashed on Scan.

2. skip-inventory-drift.sh CI guard turned red because Sprint 13.2
   added two new t.Skip sites:

     internal/ratelimit/equivalence_test.go:80
       t.Skip("race-style test under -short")
     internal/ratelimit/equivalence_test.go:88
       t.Skip("postgres equivalence tests require testcontainers;
              skipped under -short")

   The inventory at docs/testing/skip-inventory.md is auto-generated
   by scripts/skip-inventory.sh and must be re-generated alongside
   any t.Skip churn. Sprint 13.2 missed the regeneration.

   Fix: re-ran scripts/skip-inventory.sh. Totals walked
   142 → 144 sites; testing.Short() guards 76 → 78. The two new
   entries land in the internal/ratelimit section.

Verification (local sandbox, all clean):
  $ bash scripts/ci-guards/skip-inventory-drift.sh
    skip-inventory-drift guard OK: docs/testing/skip-inventory.md
    matches the live tree
  $ bash scripts/ci-guards/openapi-handler-parity.sh
    openapi-handler-parity: clean.
  $ bash scripts/ci-guards/openapi-rest-deferred-monotonic.sh
    openapi-rest-deferred-monotonic: clean — rest-deferred = 0,
    baseline = 0.
  $ gofmt -l internal/ratelimit/postgres_sliding_window.go
    (no output)
  $ go vet ./internal/ratelimit/
    (no output)

The Postgres rate-limit fix's full falsifiable proof
(TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) cannot be
exercised in the sandbox (no docker for testcontainers); CI on the
amd64 runner will re-run it on this push. The diagnosis is verified
against lib/pq source semantics and the fix uses only well-supported
primitives (pq.StringArray + canonical to_char output + time.Parse).
2026-05-14 13:26:47 +00:00
shankar0123 a41fc2d75c feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete)
Phase 13 Sprint 13.3 — the completion half of the ARCH-M1
substantive close. Sprint 13.2 shipped the Postgres-backed
sliding-window limiter + multi-replica integration test; Sprint 13.3
wires the 6 call sites in cmd/server/main.go through the operator-
chosen backend selector, adds the rate_limit_buckets scheduler
janitor sweep, rewrites the observability doc, exposes the env-var
in the helm chart, and promotes the multi-replica integration test
to a required CI status check.

Signature ground-truth (sprint 13.2 + 13.3)
===========================================
Prompt-template signatures: `Allow(key string) error` and "5 call
sites." Actual repo: `Allow(key string, now time.Time) error` and 6
NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt
miscounted the second EST per-principal arm). Per CLAUDE.md "the repo
is truth," matched the live shape.

What changed
============

internal/config/server.go (+40 LOC):
  - Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval
    time.Duration` to RateLimitConfig with full operator-facing
    documentation of the two valid values (memory|postgres) +
    when-to-use-which decision tree.

internal/config/config.go (+27 LOC):
  - Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m).
  - Validate() rejects anything other than ""/"memory"/"postgres"
    (empty = memory equivalence for test-built Configs that bypass
    Load()). Janitor interval must be ≥ 1 minute when set.
  - Failure modes return clear ::error:: with the env-var name + the
    valid values, so an operator typo ("postgress" → memory in a
    3-replica cluster) fails fast at startup.

internal/ratelimit/factory.go (NEW, 67 LOC):
  - NewLimiter(backend, db, maxN, window, mapCap) Limiter — single
    factory the 6 cmd/server/main.go call sites route through.
  - Drop-in signature: same maxN/window/mapCap as
    NewSlidingWindowLimiter (mapCap accepted + ignored for postgres
    — the rate_limit_buckets table grows until the janitor sweeps).
  - Defensive panic on unknown backend (config.Validate is SoT;
    this is belt-and-suspenders).

internal/ratelimit/postgres_gc.go (NEW, 73 LOC):
  - PostgresGC struct + NewPostgresGC + GarbageCollect.
  - Single-statement DELETE FROM rate_limit_buckets WHERE
    updated_at < NOW() - maxWindow. Idempotent.
  - maxWindow <= 0 is a no-op (operator opt-out).

internal/scheduler/scheduler.go (+90 LOC):
  - New RateLimitGarbageCollector interface (mirrors the
    ACMEGarbageCollector / SessionGarbageCollector contracts).
  - rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning
    on Scheduler.
  - SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d)
    Setters following the existing acmeGC/sessionGC pattern.
  - rateLimitGCLoop() — JitteredTicker + atomic.Bool guard +
    per-tick context.WithTimeout(1m). Logs row count at Debug.
  - Loop counted in the Start() WaitGroup only when the GC is
    non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector
    when backend=memory so the loop never launches for that case.

cmd/server/main.go (35 LOC diff):
  - All 6 ratelimit.NewSlidingWindowLimiter call sites now route
    through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend,
    db, ...). Grep verification post-fix returns ZERO hits.
  - Six sites: breakglass loginLimiter (580), ocspLimiter (1003),
    exportLimiter (1068), EST failed-basic (1535), EST per-principal
    SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The
    intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved
    — its inner type-alias wrapper is the prompt's
    out-of-scope (cmd/server/*.go only).
  - Conditionally constructs PostgresGC + wires the scheduler janitor
    when backend=postgres; logs the wiring decision either way so
    operators see "rate-limit GC sweep enabled (postgres backend)"
    or "in-memory backend self-prunes" in the boot log.

internal/api/handler/{est,export,certificates,auth_breakglass}.go:
  - Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types
    with ratelimit.Limiter (the interface). Allow() satisfies the
    same call shape on both backends; the in-memory tests that
    construct *SlidingWindowLimiter still compile because the
    concrete type satisfies the interface (compile-time check in
    internal/ratelimit/limiter.go pins this).

docs/operator/observability.md (176 LOC diff):
  - Replaced the "per-process, in-memory, reset-on-restart, not
    shared across replicas" paragraph with the new
    configurable-backend section: operator decision tree,
    backend internals (memory vs postgres), janitor description,
    falsifiable closure proof (the Sprint 13.2 integration test
    name + invocation), helm chart wiring example.
  - Updated inventory to reflect the actual handler file paths +
    actual cap configurations (the prior doc said "60s window" for
    several limiters that actually use 60m / 24h windows).
  - Doc smoke confirmed: grep -c 'per-process, in-memory,
    reset-on-restart' docs/operator/observability.md = 0.

deploy/helm/certctl/values.yaml + templates/server-configmap.yaml +
templates/server-deployment.yaml:
  - Exposed server.rateLimiting.backend (default "memory") +
    server.rateLimiting.janitorInterval (default "5m") under the
    existing rateLimiting block.
  - ConfigMap renders both as rate-limit-backend +
    rate-limit-janitor-interval keys.
  - Deployment wires CERTCTL_RATE_LIMIT_BACKEND +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap.
  - Helm render: `helm template deploy/helm/certctl --set
    server.rateLimiting.backend=postgres` shows the env-var on the
    server-deployment.yaml output.

.github/workflows/ci.yml (+12 LOC):
  - Added a new step in the Go Build & Test job that runs the
    Sprint 13.2 multi-replica integration test
    (TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with
    -tags=integration -race -timeout=300s. Fails the CI status check
    if the cross-replica row lock ever stops arbitrating across
    replicas — the ARCH-M1 closure regression gate.

Verification (all green locally; postgres integration via CI)
============================================================

  $ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go
    (zero hits — Sprint 13.3 receipt)

  $ go test -short -count=1 \
      ./internal/config/... ./internal/ratelimit/... \
      ./internal/scheduler/... ./internal/api/handler/... \
      ./cmd/server/...
    ok  internal/config       1.177s
    ok  internal/ratelimit    0.007s
    ok  internal/scheduler    9.165s
    ok  internal/api/handler  6.245s
    ok  cmd/server            0.390s

  $ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \
      ./internal/config/... ./internal/api/handler/... ./cmd/server/...
    (clean)

  $ gofmt -l internal/ cmd/server/
    (clean)

  $ grep -c 'per-process, in-memory, reset-on-restart' \
      docs/operator/observability.md
    0   (doc smoke — the audit's verbatim phrasing is gone)

  $ bash scripts/ci-guards/G-3-env-docs-drift.sh
    G-3 env-docs-drift: clean.

  $ bash scripts/ci-guards/complete-path-config-coverage.sh
    OK — every CERTCTL_* env var (197) has at least one non-config-
    package consumer.

Selector contract verified — config.Validate() rejects any value
other than ""/memory/postgres at startup with a clear error message.

Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a
different axis; ARCH-M1 closure is complete with this commit
modulo the Sprint 13.7 audit-HTML flip + zero-floor pin.

Closes: ARCH-M1 substantive remediation. The cross-replica rate-
limit-cap-enforcement gap that the audit recommended deferring to
v3 is closed; operators with server.replicas > 1 flip
CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement
across the cluster (proved by the multi-replica integration test now
gating CI).
2026-05-14 11:52:13 +00:00
shankar0123 558d350933 fix(ci): teach 3 CI guards about Phase 9 sibling-file splits
Two CI guards on origin/master failed against the Sprint-12 commit
(30940108) because they didn't know about new files introduced by
earlier Phase 9 sprints. Both are pure mechanical relocation
fall-out — no actual regression in functionality.

1. scripts/ci-guards/no-new-synthetic-admin.sh — A-8 guard
====================================================================
Sprint 5 (commit 51f9cf13) extracted the Auth-family from
internal/config/config.go to internal/config/auth.go. The 4
'actor-demo-anon' references moved with the Auth-family code:

  - Line 255: 'actor-demo-anon is wired with AdminKey=true'
    documentation comment alongside the AdminKey wiring narrative.
  - Lines 283/289/293: residual-grants detector + cleanup SQL
    examples explaining why 'ar-demo-anon-admin' is reserved.

These are the SAME comments that were previously in config.go (which
IS in the allowlist), just relocated to the new sibling file. The
references were always present in the codebase; the A-8 guard was
just unaware of the new file location.

Fix: add './internal/config/auth.go' to the ALLOWLIST with a rationale
comment pointing at commit 51f9cf13.

Local verification: A-8 guard PASS — actor-demo-anon references
confined to the declared 19-entry allowlist (was 18, now 19).

2. internal/ciparity/surface_parity_test.go — mcpToolFiles list
====================================================================
Sprint 10 (commit fbe053aa) split internal/mcp/tools.go (1867 LOC,
121 mcp.AddTool registrations) into six tool-domain sibling files:

  tools_certificates.go (22 tools — cert + CRL/OCSP + renewal + verify)
  tools_agents.go       (16 tools — agents + agent groups)
  tools_resources.go    (40 tools — issuers + targets + policies +
                                    profiles + teams + owners +
                                    notifications + intermediate-CAs)
  tools_jobs.go         (9  tools — jobs + approvals)
  tools_discovery.go    (10 tools — network-scan + discovery)
  tools_admin.go        (24 tools — audit + stats + digest + metrics
                                    + health + health-check)

The TestSurfaceParity_MCPToolCatalogue hard-gate counts mcp.AddTool
registrations across mcpToolFiles() — a hard-coded 5-file list. After
the split, only 34 tools sat in the 5 known files (tools.go itself
went to 0 tools post-split; only the 4 pre-existing tools_*.go
siblings carried any). The actual cross-file count is 155 (above
the 150 floor).

Fix: expand mcpToolFiles() to include the 6 new Sprint-10 sibling
files. Doc-comment explains the Sprint-10 split + the union-of-files
intent.

Local verification:
  PASS: TestSurfaceParity_MCPToolCatalogue
    MCP tool catalogue: 155 tools (baseline floor 150)

3. docs/testing/skip-inventory.md — line-number drift
====================================================================
Adding the 8-line doc-comment to mcpToolFiles() (item 2) shifted the
location of readFileOrSkip from line 97 to line 113 in
surface_parity_test.go. The skip-inventory.md is auto-generated and
records every t.Skip() site with its file:line; the
skip-inventory-drift CI guard re-runs the generator and diffs.

Fix: bump the inventory entry from :97 to :113. One-line tracking
update; same skip site, new line number. (No t.Skip() was added or
removed.)

Behavior preservation contract
==============================
- Zero runtime change. All three diffs touch only CI-guard
  metadata (allowlist string, file-list slice, doc line-number).
- A-8 guard re-runs clean post-fix.
- TestSurfaceParity_MCPToolCatalogue runs and reports 155 tools.
- skip-inventory drift detection re-pins to the live line number.
- gofmt + go vet + staticcheck remain clean on the touched files
  (verified pre-commit; the sandbox /sessions partition is full so
  the broader 'all guards' loop was interrupted on a tmpfile write,
  not on a real regression — the deterministic fix above matches
  the CI failure output byte-for-byte).

Closes: CI failures on commit 30940108 across Frontend Build (A-8
guard) + Go Build & Test (TestSurfaceParity_MCPToolCatalogue).
2026-05-14 11:04:32 +00:00
shankar0123 1279172e9b loadtest: close Phase 8 SCALE-H2 — add scale-tier scenarios
Phase 8 of the certctl architecture diligence remediation closes
SCALE-H2 by adding three new k6 scenarios that exercise the scale-
relevant load surfaces the API tier + connector tier left uncovered:
fleet-scale bulk renewal, ACME enrollment burst, and agent heartbeat
storm.

Audit miscount + path correction (live-grep at Phase 8 audit time)
==================================================================
- The Phase 8 prompt referenced both `deploy/test/load/` and
  `deploy/test/loadtest/`. Repo truth: the existing harness lives at
  `deploy/test/loadtest/`. New scenarios land there.
- The audit's prior framing "k6 covers the API tier at 50 req/s
  only" omitted Bundle 10 (2026-05-02) which added four connector-
  tier handshake scenarios (nginx/apache/haproxy/f5) at 100 conns/min
  each, plus the Phase 5 ACME directory/nonce/ARI scenario at 100 VUs
  in `k6/acme_flow.js`. Phase 8 appends to what's there rather than
  rewriting.

What ships
==========

Three new k6 scenario files under deploy/test/loadtest/k6/:

  bulk_renewal.js — 10K-cert seed + 5 req/s POST /bulk-renew × 5min
                    p99 < 5s, p95 < 2s, errors < 1%
  acme_burst.js   — 200 VU sustained × directory/nonce/ARI × 5min
                    directory p95 < 500ms, nonce p95 < 300ms,
                    renewal-info p95 < 800ms, 5xx-only < 0.1%
                    Pins RFC 7807 rate-limit response shape via
                    acme_rate_limit_shape_ok Counter.
  agent_storm.js  — 5K-agent seed + 167 req/s POST /heartbeat × 5min
                    p99 < 1s, p95 < 500ms, errors < 0.1%

Two seed SQL fixtures under deploy/test/loadtest/seed/:

  01_bulk_renewal_certs.sql — 10,000 managed_certificates rows
    linked to seed_demo.sql FKs (iss-local, o-alice, t-platform,
    rp-standard). status='active', expires_at distributed across
    next 30 days, name prefix `loadtest-bulk-` so the scenario
    can scope its criteria. Idempotent via
    ON CONFLICT (name) DO NOTHING.

  02_agent_fleet.sql — 5,000 agents rows with name prefix
    `loadtest-agent-`. status='Online', last_heartbeat_at
    staggered across prior 60s, OS distribution 80%/10%/10%
    linux/windows/darwin. Idempotent via
    ON CONFLICT (id) DO NOTHING.

Plus seed/README.md documenting the opt-in profile + when these
run vs the default `make loadtest` fast path.

Compose + Makefile + CI wiring
==============================

deploy/test/loadtest/docker-compose.yml gains four new services,
all gated behind the `scale` compose profile so the default
`make loadtest` is unchanged:

  scale-seed       — one-shot postgres:16-alpine container that runs
                     every ./seed/*.sql in lexical order against the
                     same postgres the server uses. Depends on
                     postgres healthy + certctl-server healthy (so
                     migrations + seed_demo.sql have already run).
  k6-scale-bulk    — grafana/k6:0.54.0 driver running bulk_renewal.js
  k6-scale-acme    — grafana/k6:0.54.0 driver running acme_burst.js
  k6-scale-agent   — grafana/k6:0.54.0 driver running agent_storm.js

Each driver depends_on scale-seed completed_successfully so the
scenarios never run against an unseeded DB (the acme scenario
doesn't need the seed itself but uses the same dependency chain for
ordering predictability).

Makefile gains four new phony targets:

  loadtest-scale-bulk   - runs bulk_renewal.js via compose --profile scale
  loadtest-scale-acme   - runs acme_burst.js
  loadtest-scale-agent  - runs agent_storm.js
  loadtest-scale        - all three serially

.github/workflows/loadtest.yml gains a new k6-scale matrix job that
runs after the existing k6 job (needs: k6) with a matrix on the
three scenarios — fail-fast: false so a regression in one scenario
doesn't cancel the others. Same workflow_dispatch + weekly cron
cadence as the existing API + connector tier job.

Documentation
=============

docs/operator/scale.md gains a new "Scale-tier scenarios (SCALE-H2,
Phase 8)" section between the cursor-pagination subsection and the
profiling-production subsection. Documents:
  - Scenario + seed + sustained load table
  - Threshold contract (regression guards, NOT measured baselines)
  - Measured-baseline table with TBD placeholders + the canonical-
    hardware capture procedure
  - How to run the scale tier locally
  - Four documented limitations (JWS-signed ACME, scheduler renewal
    scan throughput, production-sized Postgres, pull-only deployment
    model)

deploy/test/loadtest/README.md gains a short "Scale tier (Phase 8
SCALE-H2, 2026-05-14)" section pointing at scale.md as the canonical
operator-facing baseline source. Avoids duplication; the README
remains the harness-mechanics doc.

Deliberate deviations from the prompt
======================================

The Phase 8 prompt's "concrete deliverables" section referenced
`deploy/test/load/` (no -test) for the new k6 files. The actual
harness lives at `deploy/test/loadtest/` — the new files land there
to match existing convention. The prompt's audit-questions section
also referenced `deploy/test/loadtest/` so the prompt was internally
inconsistent on this; repo truth wins.

The prompt described the ACME burst as "200 concurrent ACME orders
against /acme/profile/<id>/new-order ... pin the rate-limit response
shape." new-order is JWS-signed (RFC 8555 §7.4 requires JWS for
every POST except newAccount-pre-account-key flows). k6 doesn't
ship JWS and bundling a signer (e.g. lego) into the k6 container
would obscure the server-side latency the scenario is trying to
measure. Same trade-off the existing Phase 5 acme_flow.js made.
Phase 8's acme_burst.js measures the unauthenticated
directory + nonce + ARI surface at burst rate AND pins the 429
rate-limit response shape via a custom Counter that increments only
when the response is `application/problem+json` with the
`urn:ietf:params:acme:error:rateLimited` type. End-to-end JWS
conformance under load remains a follow-up; the canonical JWS
correctness gate is `make acme-rfc-conformance-test` (lego-based,
non-load).

Deferred (operator-side, not engineering)
==========================================

Canonical-hardware baseline capture. The TBD placeholders in
docs/operator/scale.md's measured-baseline table are intentional —
sandbox-captured numbers from a developer laptop are misleading
(same anti-pattern the original loadtest README guards against).
Operator triggers loadtest.yml from the Actions tab, waits for the
k6-scale matrix jobs to complete, downloads the per-scenario
summary artifacts, copies p50/p95/p99 into the table, commits the
captured numbers alongside the date + commit SHA.

Files changed (10):
  .github/workflows/loadtest.yml                            (+72 -1)
  Makefile                                                  (+47 -1)
  deploy/test/loadtest/README.md                            (+28 -1)
  deploy/test/loadtest/docker-compose.yml                   (+108 -1)
  deploy/test/loadtest/k6/bulk_renewal.js                   (new, 106 lines)
  deploy/test/loadtest/k6/acme_burst.js                     (new, 192 lines)
  deploy/test/loadtest/k6/agent_storm.js                    (new, 124 lines)
  deploy/test/loadtest/seed/01_bulk_renewal_certs.sql       (new, 95 lines)
  deploy/test/loadtest/seed/02_agent_fleet.sql              (new, 92 lines)
  deploy/test/loadtest/seed/README.md                       (new, 86 lines)
  docs/operator/scale.md                                    (+109 -0)

Verification (sandbox-runnable):
  python3 -c 'import yaml; yaml.safe_load(open("deploy/test/loadtest/docker-compose.yml"))'
    → compose YAML OK
  python3 -c 'import yaml; yaml.safe_load(open(".github/workflows/loadtest.yml"))'
    → workflow YAML OK
  grep -E 'bulk_renewal|acme_burst|agent_storm' deploy/test/loadtest/k6/*.js
    → all three scenarios + tags present
  grep loadtest-scale Makefile
    → 4 new targets registered in .PHONY + 3 recipes + 1 aggregate

Runtime verification (deferred — requires docker on canonical hardware):
  make loadtest-scale-bulk    # 10K cert fixture + 5 req/s × 5min
  make loadtest-scale-acme    # 200 VU × 5min
  make loadtest-scale-agent   # 5K agent fixture + 167 req/s × 5min
  make loadtest-scale         # all three serially

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-H2
2026-05-14 03:25:15 +00:00
shankar0123 8191b1ee64 scheduler+db: close Phase 6 — scale hardening across pool, jitter, ETag, asyncpoll
Phase 6 of the certctl architecture diligence remediation. Five
findings across the same scheduler-and-DB-pool surface.

SCALE-M1 (Med) — DB pool default bumped 25 → 50
  internal/config/config.go line 1972:
    MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50)
  Postgres default max_connections is 100; 50 leaves headroom for
  pg_dump + ad-hoc psql + a server replica without exhausting the
  DB-side cap. Operator override env var unchanged. Operator-tune
  ladder for larger fleets (5K / 50K certs) lives in
  docs/operator/scale.md as starter values pending Phase 8 load
  tests — explicitly marked TBD.

SCALE-M3 (Med) — async-CA poll budget operator-configurable
  Live state was partially-already-shipped: all 4 async-CA
  connectors (digicert, entrust, globalsign, sectigo) already have
  per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5
  closed pre-Phase-6). What was missing: a global package-default
  override. Shipped:
    - internal/connector/issuer/asyncpoll/asyncpoll.go gains
      SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the
      currentDefaultMaxWait() priority resolver.
    - cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS
      at boot and calls SetDefaultMaxWait.
    - deploy/ENVIRONMENTS.md documents the new env var (G-3 guard
      green).
  Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS:
  the live code tracks wall-clock time (MaxWait), not attempt count.
  Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS)
  so the priority chain reads naturally.

SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops
  internal/scheduler/jitter.go ships NewJitteredTicker(interval,
  jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in
  internal/scheduler/scheduler.go migrated from bare time.NewTicker
  to NewJitteredTicker(interval, DefaultSchedulerJitter). Base
  intervals unchanged; only the per-tick envelope adds ±10%
  randomized delay so multiple loops with the same nominal cadence
  don't co-fire and spike CPU + DB at wall-clock boundaries.

  internal/scheduler/jitter_test.go pins:
    - Bounded envelope (each tick within ±jitterPct of interval)
    - Mean drift < 30% of nominal (sign-bug detector)
    - Stop() releases the goroutine + closes C
    - Stop() idempotent (no panic on repeat)
    - Zero-jitter behaves like time.NewTicker
    - Negative and >=1 jitterPct values clamped defensively

  CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks
  any future bare time.NewTicker in scheduler.go.

SCALE-L1 (Low) — renewal-sweep semaphore behavior documented
  docs/operator/scale.md "Scheduler tick budgets" section explains
  the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25
  default), the ctx-cancellation drain on tick-budget overrun, and
  operator tuning advice (raise concurrency + DB pool together).
  No code change — the behavior is defensible as-is per the audit.

SCALE-L2 (Low) — ETag middleware for top-5 read endpoints
  internal/api/middleware/etag.go computes SHA-256 ETag over the
  buffered response body, respects If-None-Match, short-circuits
  to 304 Not Modified on match. GET/HEAD only; non-2xx responses
  pass through unchanged. 64 KiB buffer cap degrades gracefully on
  oversized responses (no caching, body still flushes intact).

  Wired around the top-5 read endpoints via etagged() helper in
  internal/api/router/router.go:
    GET /api/v1/certificates
    GET /api/v1/agents
    GET /api/v1/jobs
    GET /api/v1/audit
    GET /api/v1/discovered-certificates

  internal/api/middleware/etag_test.go pins 11 behaviors including
  304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass,
  4xx/5xx pass-through, oversized-response degradation, wildcard
  match, HEAD-treated-like-GET, byte-equal pass-through.

Cross-cutting fixes:
  - internal/config/config_test.go::TestLoad_DefaultValues updated
    to assert the new 50 default (was 25).
  - deploy/helm/certctl/values.yaml comment corrected — agent
    pollInterval is hardcoded 30s, not env-configurable; the
    Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL
    which G-3 caught as a phantom env var.
  - asyncpoll.go reformatted by gofmt; functionally unchanged.

Verification (all pass):
  grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go    # finds 1 site
  grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go  # config default is 50
  grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md  # wired
  grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go    # 0 (all migrated)
  grep -cE 'JitteredTicker' internal/scheduler/scheduler.go         # 15
  ls internal/scheduler/jitter.go internal/api/middleware/etag.go   # both exist
  ls docs/operator/scale.md                                          # exists
  bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh          # clean
  bash scripts/ci-guards/G-3-env-docs-drift.sh                      # clean
  go test ./internal/scheduler/ ./internal/api/middleware/ \
    ./internal/connector/issuer/asyncpoll/ ./internal/config/       # 4/4 packages green

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2
2026-05-14 01:23:03 +00:00
shankar0123 d6f4d5c5e8 deploy(helm): close Phase 4 — chart surface + DR + ops runbooks
Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.

DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
  Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl
  matching the canonical shape in
  docs/operator/runbooks/postgres-backup.md (so manual and
  automated dumps are byte-identical). Sink: PVC (default) OR S3
  via aws-cli. Documented as in-cluster-Postgres only — managed DB
  deployments rely on their provider's PITR.

DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
  deploy/helm/certctl/templates/migration-job.yaml — runs
  `certctl-server --migrate-only` before the server Deployment
  rolls. The --migrate-only flag (new in cmd/server/main.go) is a
  hermetic schema-mutation pass: load config, open DB pool, run
  RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
  no signing setup.

  Server's boot-time RunMigrations call is now gated on
  CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
  the boot path (the hook owns the work). Default still runs at
  boot, so Compose / VM / bare-metal deploys are unchanged.

  migrations.viaHook: false in values.yaml (off by default).

DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
  deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
    spec.updateStrategy.type: OnDelete
    spec.podManagementPolicy: OrderedReady
  Operator-controlled Postgres upgrades (the OnDelete strategy
  means a chart template tweak no longer triggers an immediate
  Postgres restart). OrderedReady aligns with the standard
  Postgres-on-Kubernetes pattern for any future HA work.

DEPL-M5 (Med) — per-fleet-size resource ladder documentation
  deploy/helm/certctl/values.yaml — extended comments next to
  server.resources + agent.resources documenting:
    "≤ 500 certs / 100 agents" → defaults are validated
    "5K certs / 1K agents" → starter suggestions, TBD Phase 8
    "50K certs / 10K agents" → starter suggestions, TBD Phase 8
  Numbers for the small-fleet case derive from the measured
  baselines in docs/operator/performance-baselines.md
  (50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
  fleet numbers explicitly marked TBD pending Phase 8 load-test
  runs — operators tune empirically until then.

DEPL-L1 (Low) — Helm rollback runbook
  docs/operator/runbooks/rollback.md — covers helm rollback
  mechanics, the schema-migration manual-cleanup path (when
  *.down.sql files apply vs. when full restore is the only safe
  path), and the per-migration-class safe-to-rollback table.

DEPL-L2 (Low) — Prometheus AlertManager rules
  deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
  monitoring.prometheusRules.enabled=true. Default OFF. Four
  starter rules using verified metric names from
  internal/api/handler/metrics.go:
    CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
    CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
    CertctlJobFailureRateHigh (failure rate over 5% for 15m)
    CertctlIssuanceFailures (any failures over 15m window)
  All thresholds operator-tunable via
  monitoring.prometheusRules.thresholds.* in values.

DEPL-L3 (Low) — Prometheus bearer-token setup runbook
  docs/operator/runbooks/prometheus-bearer-token.md — documents
  the API-key + Secret + values wiring for the RBAC-gated
  /api/v1/metrics/prometheus scrape endpoint. End-to-end
  procedure with troubleshooting steps + rotation guide.

CI guard: scripts/ci-guards/helm-templates-lint.sh
  Six-combo matrix: defaults / backup PVC / backup S3 /
  prometheusRules / migrations.viaHook / all-on. Each runs helm
  template + checks render success. helm lint also gated.
  Wired into the auto-pickup loop in .github/workflows/ci.yml;
  azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
  RED-2) installs helm v3.16.0 on the runner.

Verification (all pass):
  ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
  grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml  # 2 matches
  helm template deploy/helm/certctl/ --set backup.enabled=true \
    --set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
    | grep -E "kind: (CronJob|PrometheusRule|Job)"  # 3 matches
  helm lint deploy/helm/certctl/  # 0 failed
  ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
  bash scripts/ci-guards/helm-templates-lint.sh  # 6/6 matrix combinations pass

Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
2026-05-14 00:58:00 +00:00
shankar0123 09c29b9f40 docs: shift to Pattern A in history-normalization.md
Phase 0 follow-up — Pattern A migration (post-Pattern-C trailer strip
+ archive tag deletion).

Updates the public-facing explanation to match the post-strip state:
no more Co-authored-by trailers in commit messages, no more archive
tag on origin. The off-platform bundle remains as the canonical
pre-rewrite preservation record.

Why the change from Pattern C → A: the Co-authored-by trailers added
in the original rewrite caused GitHub to render the AI identities
(claude, cowork, certctl-bot, certctl-copilot, github-actions) as
co-author chips on every AI-touched commit AND count them in the
repo's contributor graph. Operator opted to clean the contributor
list. The legal posture (counsel-signed AI-authorship declaration in
cowork/legal/) is unchanged — only the git-history layer's
transparency signal was dialed back.

Bundle at cowork/legal/pre-rewrite-2026-05-13.bundle still preserves
the original history (all 14 author identities + un-stripped commit
messages) for any future forensic / diligence question.
2026-05-13 23:14:20 +00:00
shankar0123 921dac7e6b docs: explain the Phase 0 git history normalization
Public-facing transparency artifact for the 2026-05-13 git-history
rewrite. Plain-language explanation of: what changed (uniform author
metadata to canonical operator identity + Co-authored-by trailers
preserving AI involvement), why (LLC ownership transfer to certctl LLC
+ pre-traction cleanup), what is preserved (archive tag +
off-platform bundle), how to recover a stale clone, and the operational
note that external PRs aren't accepted until a CLA workflow is set up.

The README pointer to this doc is intentionally omitted — the page is
discoverable via grep against the repo (`history-normalization`),
via the next CHANGELOG entry, and via any forensic observer who
notices the rewrite and grep-searches for an explanation.

Closes the public-transparency leg of Phase 0 (Path B2, Pattern C).
2026-05-13 21:24:09 +00:00
shankar0123 02438ad9e1 ci: floor raise + doc drift (Phase 3 closure — TEST-H1/H2/M1/M2/M3/M4/L1, ARCH-H3/L1/L2/L3/L4)
Twelve findings from the architecture diligence audit's Phase 3 bundle
closed in one PR. All touch the CI workflows + small doc-drift fixes
across the production Go tree + migration headers.

CI workflow changes
====================

TEST-H1 — Race detection on ./... -short
  .github/workflows/ci.yml:106 was a 9-package explicit list. Audit
  finding TEST-H1 flagged that 25+ packages (internal/auth/*,
  internal/repository/*, internal/mcp, internal/scep, internal/pkcs7,
  internal/api/router, internal/api/acme, internal/cli, internal/cms,
  internal/config, internal/deploy, internal/integration,
  internal/ratelimit, internal/secret, internal/trustanchor, all of
  cmd/) silently dropped off race coverage.
  Post-fix: 'go test -race -short ./... -count=1 -timeout 600s'.
  76 testing.Short() guards already cover testcontainers + live-DB
  integration suites, so -short keeps the long-running tests out.

TEST-H2 — Cross-platform build matrix
  New 'cross-platform-build' job in ci.yml. Matrix:
  ubuntu-latest + windows-latest + macos-latest, fail-fast: false.
  Builds cmd/server + cmd/agent + cmd/cli + cmd/mcp-server on each.
  Catches Windows-specific regressions (path separators, file
  permissions, exec.Command semantics) the pre-Phase-3 Ubuntu-only
  CI missed.

TEST-L1 — actions/setup-go cache: true (explicit)
  setup-go v5 defaults cache: true; making it explicit so a future
  setup-go upgrade can't silently flip it. Re-runs hit the Go module
  + build cache instead of recompiling cold.

TEST-M1 — Mutation-testing floor at 55%
  security-deep-scan.yml::go-mutesting step rewritten. Removed
  continue-on-error + per-package '|| true'. New post-loop check
  extracts every 'The mutation score is X.YZ' line and fails the
  step if any package drops below 0.55. Floor rationale: starter
  ratio catches major regressions without rejecting the audit's
  'this is OK' steady state; raise quarterly.

TEST-M2 — 3 advisory deep-scan gates promoted to blocking
  Removed continue-on-error: true from:
    - gosec (filtered to G201/G202/G304/G108 high-signal rules:
      SQL-injection + path-traversal + pprof-exposed)
    - osv-scanner (multi-ecosystem CVE; complements govulncheck
      which is already blocking in ci.yml)
    - trivy image scan (--severity HIGH,CRITICAL --exit-code 1)
  continue-on-error count: 15 → 11.
  ZAP / schemathesis / nuclei / testssl stay advisory because their
  false-positive rates on https://localhost:8443-targeted DAST runs
  are high.

TEST-M3 — Playwright harness stub
  web/package.json adds '@playwright/test' devDep + 'e2e' / 'e2e:install'
  npm scripts. web/playwright.config.ts ships single chromium project
  with webServer block pointing at 'npm run dev'. web/src/__tests__/
  e2e/smoke.spec.ts proves the harness wires through. The full 15-flow
  suite ships in frontend-design-audit Phase 8 (TEST-H1 in THAT audit);
  this is the wiring + a single smoke test as the regression floor.
  New Makefile target: 'make e2e-test'.

Doc/code drift fixes
====================

TEST-M4 + ARCH-L2 — Skip inventory artifact + CI guard
  scripts/skip-inventory.sh walks every t.Skip site under cmd/ +
  internal/ + deploy/test/ and emits docs/testing/skip-inventory.md
  grouped by package with file:line:expression triples. Current
  inventory: 142 t.Skip sites, 76 testing.Short() guards.
  scripts/ci-guards/skip-inventory-drift.sh regenerates and fails on
  diff (excluding the 'Last reviewed' timestamp line which drifts
  daily). The Markdown is the canonical acquisition-diligence artifact
  for 'what tests are being skipped and why.'

ARCH-H3 — MCP catalogue floor reconciliation
  Audit framing was '121 vs floor 150 — doc/code drift.' Live count
  via the test's actual regex over all 5 tool files (tools.go +
  tools_audit_fix.go + tools_auth.go + tools_auth_bundle2.go +
  tools_est.go): 155 unique 'Name: "certctl_*"' declarations.
  Pre-Phase-3 audit measured tools.go in isolation (121) and missed
  the other 4 files (+34 unique names). The test at
  internal/ciparity/surface_parity_test.go::TestSurfaceParity_MCP
  passes today (155 ≥ 150). Added a clarifying comment near
  mcpBaselineFloor explaining the measurement scope so future
  reviewers don't repeat the audit's framing error.
  STATUS: stale — no code drift, just a measurement scoping error in
  the audit.

ARCH-L1 — panic() rationale comments
  5 panic sites in production Go (excluding _test.go):
    - internal/repository/postgres/tx.go:84
    - internal/service/issuer.go:861 (mustJSON)
    - internal/service/est.go:728 (mustParseTime)
    - internal/service/acme.go:1288 (rand source failure — already documented)
    - internal/pkcs7/certrep.go:270 (OID marshal — already documented)
  Added ARCH-L1 rationale comments to the 3 sites that didn't have
  them. All 5 are defensible impossible-path / rethrow / hardcoded-
  constant guards.

ARCH-L3 — Migration IF-NOT-EXISTS carve-outs
  4 migrations skip the literal 'IF NOT EXISTS' token but ARE
  idempotent via different Postgres patterns:
    - 000014_policy_violation_severity_check.up.sql: ALTER TABLE
      ADD CONSTRAINT CHECK doesn't accept IF NOT EXISTS; idempotency
      via DROP CONSTRAINT IF EXISTS preamble.
    - 000018_audit_events_worm.up.sql: CREATE OR REPLACE FUNCTION
      + DROP TRIGGER IF EXISTS + CREATE TRIGGER + DO $$ pg_roles
      existence check. CREATE TRIGGER doesn't take IF NOT EXISTS.
    - 000030_rbac_admin_perms.up.sql: INSERT ... ON CONFLICT DO NOTHING.
    - 000039_audit_crit1_perms.up.sql: same INSERT + ON CONFLICT pattern.
  Added ARCH-L3 header comments to each explaining the carve-out so
  reviewers don't flag the missing literal token.
  STATUS: largely stale — migrations are already idempotent.

ARCH-L4 — TODO/FIXME → see #<descriptor>
  5 TODOs rewritten to the allowed 'see #<descriptor>' pattern:
    - internal/repository/postgres/auth.go:220 → see #bundle-2-scope-fk
    - internal/connector/discovery/gcpsm/gcpsm.go:547 → see #gcpsm-pagination
    - internal/service/audit.go:244 → see #audit-pagination-count
    - internal/service/job.go:295, 299 → see #validation-job-impl
  New CI guard scripts/ci-guards/no-todo-in-prod.sh grep-fails any
  new TODO/FIXME in cmd/ + internal/ (excluding _test.go); allows
  'see #N' / 'see #<descriptor>' patterns.

Sandbox limitation
==================
The 6.1 GB certctl working tree fills the sandbox volume; go1.25.10
toolchain download fails with 'no space left on device' (sandbox has
1.25.9; go.mod requires 1.25.10). Local 'go test' / 'go build' NOT
run in this commit. Operator must run 'make verify' on their
workstation before push per CLAUDE.md operating rules.

The smoke.spec.ts NOT executed in the sandbox (no chromium installed).
Operator runs 'cd web && npm install && npx playwright install
--with-deps chromium && npm run e2e' on first wire-up.

All CI guards (no-todo-in-prod, skip-inventory-drift, G-3
env-docs-drift, doc-rot-detector, and every existing guard) verified
clean by running each individually.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-TEST-H1,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-H2,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-M1,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-M2,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-M3,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-M4,
        cowork/certctl-architecture-diligence-audit.html#fix-TEST-L1,
        cowork/certctl-architecture-diligence-audit.html#fix-ARCH-H3,
        cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L1,
        cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L2,
        cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L3,
        cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L4
2026-05-13 20:10:08 +00:00
shankar0123 69a2b5c55a config: default hardening + operator docs (Phase 2 closure — SEC-H1, SEC-H3, SEC-M4, DEPL-H1, DEPL-M2 + doc-only carve-outs)
Eleven findings from the architecture diligence audit's Phase 2 bundle
closed in one PR. All touch the same backend config + Helm chart +
operator docs surface, so reviewing in one diff is the natural fit.

config.go: three new fail-closed Validate() branches behind sentinels
=====================================================================

Three new error sentinels exported from internal/config/config.go for
tests to pin via errors.Is + message-text:
  - ErrAgentBootstrapTokenRequired (SEC-H1)
  - ErrACMEInsecureWithoutAck      (SEC-M4)
  - ErrDemoModeAckExpired          (SEC-H3)

SEC-H1 (staged): introduces CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY
as an opt-in feature flag. When true AND the bootstrap token is empty,
Validate() returns ErrAgentBootstrapTokenRequired and the server
refuses to start. Default in THIS release: false (warn-mode
pass-through preserved). WORKSPACE-ROADMAP.md schedules the default
flip to true for v2.2.0 — operators get one upgrade window.

SEC-M4: upgrades the existing boot-time WARN log for
CERTCTL_ACME_INSECURE=true into a hard refuse-to-start gate behind
CERTCTL_ACME_INSECURE_ACK=true. The ACK env var must be paired with
the existing INSECURE flag; either alone fails closed. The boot-time
WARN log at cmd/server/main.go:611 continues to fire for the ACK'd
case so every restart logs the reminder.

SEC-H3: tightens the sticky DemoModeAck bit so it expires after 24h.
When DemoModeAck=true, Validate() now requires CERTCTL_DEMO_MODE_ACK_TS
to be set as a unix-epoch timestamp within the last 24h (24h-tolerance
on the past side, 1-minute clock-skew on the future side). Catches the
"forgotten demo deployment promoted to production" failure mode —
next container restart past 24h refuses unless re-ack'd.

Tests in internal/config/config_test.go cover every new branch:
positive (passes when properly set), negative (each fail-closed path
fires with the matching sentinel + message-text). 11 new tests added.

Helm chart + HA runbook (DEPL-H1)
=================================

Created docs/operator/runbooks/ha.md documenting the three values
flips required for production HA: server.replicas, podDisruptionBudget,
service.sessionAffinity. Cross-link comments added to
deploy/helm/certctl/values.yaml next to the server.replicas (line 19)
and podDisruptionBudget (line 566) defaults. DEFAULTS DO NOT CHANGE
— that's the point per the prompt's 'do not flip networkPolicy default'
guidance: a default-enabled PDB blocks fresh helm install on
single-node clusters.

CI guard (DEPL-M2)
==================

scripts/ci-guards/no-change-me-in-prod-compose.sh grep-fails any
'change-me-' literal in compose files OTHER than docker-compose.demo.yml.
Catches the placeholder-credential-leak regression one layer earlier
than the runtime Validate() fail-closed guards from Bundle 2 (2026-05-12).
Excludes comment lines so docs explaining the pattern don't trip the
guard. Verified to fire on a synthetic leak; clean on the current tree.

Consolidated 'Security carve-outs' doc section
==============================================

docs/operator/security.md grows by one new section documenting the
seven existing carve-outs in one canonical place:
  - SEC-M3: 3 InsecureSkipVerify=true sites (Agent dev, verify probe, tlsprobe)
  - SEC-M5: F5 connector InsecureSkipVerify per-config field
  - SEC-M4: ACME insecure + new ACK gate
  - SEC-L1: CSP 'unsafe-inline' on style-src (Tailwind carve-out)
  - SEC-L2: break-glass Argon2id rest-defense reminder
  - SEC-L3: 1 MB body-size cap + CERTCTL_MAX_BODY_SIZE override
  - DEPL-M2: change-me-* placeholder credentials in demo overlay
  - DEPL-M3: K8s NetworkPolicy operator-opt-in default

Each entry cites the file:line, the rationale for the carve-out, and
the operator action.

CHANGELOG + ENVIRONMENTS coverage
==================================

CHANGELOG.md grows by one new '### Breaking changes (scheduled for
v2.2.0)' section under Unreleased, documenting SEC-H1 / SEC-M4 / SEC-H3
with explicit upgrade-window guidance for each.

deploy/ENVIRONMENTS.md adds five rows: AGENT_BOOTSTRAP_TOKEN +
AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY + DEMO_MODE_ACK + DEMO_MODE_ACK_TS +
ACME_INSECURE_ACK. G-3 env-docs-drift CI guard stays clean.

WORKSPACE-ROADMAP.md (cowork-side) schedules the SEC-H1 default-flip
for v2.2.0.

Sandbox limitation
==================

The certctl repo's working tree is 6.1 GB which fills the sandbox
volume; the go1.25.10 toolchain download (go.mod requires it,
sandbox has 1.25.9) keeps failing on disk-full. Local 'go build' /
'go test' were NOT run in this commit's verification path.
make verify MUST be run on the operator's workstation before push
per CLAUDE.md operating rules.

CI guards (no-change-me, G-3 env-docs-drift, doc-rot-detector, +
all existing) verified clean by running each individually.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SEC-H1,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-H3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M4,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H1,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M2,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M5,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L1,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L2,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L3
2026-05-13 19:50:00 +00:00
shankar0123 0161bb201c docs: remove internal engineering docs; docs must be tool- or story-relevant
Operator policy: docs in the public repo must help (a) a user
deploying certctl or (b) the product story. Internal engineering
process documentation belongs in cowork/ scratchpads or in git
commit history, not docs/.

Removed (docs/contributor/, 8 files, 2,323 lines):
  - release-sign-off.md         — internal release-day checklist
  - ci-pipeline.md              — what runs in CI (internal)
  - ci-guards.md                — what the guards are (internal)
  - testing-strategy.md         — internal testing strategy
  - qa-test-suite.md            — internal QA reference (445 lines)
  - qa-prerequisites.md         — internal QA setup
  - gui-qa-checklist.md         — manual GUI QA checklist
  - test-environment.md         — 1,103-line redundant with
                                  docs/getting-started/quickstart.md +
                                  docs/getting-started/advanced-demo.md

Removed supporting script:
  - scripts/qa-doc-seed-count.sh — CI guard for the deleted
                                   qa-test-suite.md seed-data table

Cross-reference cleanup:
  - README.md: dropped the Contributor audience row + footer
    pointer to docs/contributor/.
  - Makefile: dropped `verify-docs` target + qa-stats comment refs.
  - .github/workflows/ci.yml: dropped the QA-doc seed-count drift
    CI step + dead comment refs.
  - docs/reference/cli.md: repointed qa-prerequisites.md → quickstart.md.
  - docs/operator/performance-baselines.md: dropped ci-pipeline.md
    cross-ref.
  - scripts/ci-guards/README.md: dropped the 'Guards explicitly
    NOT here' section that referenced the deleted QA-doc guards.

G-3 env-docs-drift guard improvements (a real consequence: deleting
the contributor docs surfaced that some env vars only had a home
there). Refit the guard to the new doc topology:
  - Defined-scan widened from `config.go + cmd/*` to all of `cmd/ +
    internal/` (production code), excluding `*_test.go` — catches
    service-layer env vars like CERTCTL_STEPCA_ROOT_CERT and
    CERTCTL_ZEROSSL_EAB_URL that were previously invisible to the
    guard.
  - Docs-scan widened to include deploy/ENVIRONMENTS.md (the
    canonical env-var inventory table — should have been in scope
    from day one). Kept narrow to README + docs/ + deploy/helm/ +
    ENVIRONMENTS.md to avoid pulling in compose/test fixtures.
  - ALLOWED filter now applies to both DOCS_ONLY and CONFIG_ONLY
    directions, so dynamic per-profile dispatch surfaces
    (CERTCTL_SCEP_PROFILE_<NAME>_*, CERTCTL_EST_PROFILE_<NAME>_*,
    CERTCTL_QA_*) don't need static doc entries.
  - Added CERTCTL_SCEP_PROFILE_[A-Z_]+ and CERTCTL_EST_PROFILE_[A-Z_]+
    to ALLOWED for the same reason.

deploy/ENVIRONMENTS.md: added CERTCTL_ZEROSSL_EAB_URL row — real
operator override (overrides the ZeroSSL EAB-credentials endpoint;
read at internal/connector/issuer/acme/acme.go:372) that was
defined in Go source but never documented. G-3 caught it after the
defined-scan widened.

scripts/ci-guards/S-1-hardcoded-source-counts.sh: removed dead
WORKSPACE-CHANGELOG.md allowlist entry (the file was deleted in
the prior workspace cleanup).

Verified:
  All 35 scripts/ci-guards/*.sh green (FAIL=0).
  No remaining references to docs/contributor/ or qa-doc-seed-count
  in tracked files.
2026-05-13 02:44:27 +00:00
shankar0123 57b539c378 docs(b12): observability reference + Postgres backup runbook
Closes acquisition-diligence Bundle 12 — Observability, DR,
Operations Receipts, And Performance Proof. Source IDs: D5, D6, D8,
T9, finding 7, OPS-H1, OPS-M1, OPS-M2, LOW-7.

Two new operator-facing references; both non-audit-framed per the
Bundle 5 doc-placement policy.

docs/operator/observability.md — single canonical statement of what
certctl emits, what it doesn't, and what survives a restart:
  - Metrics surface: both /api/v1/metrics (JSON) and
    /api/v1/metrics/prometheus (text exposition v0.0.4); inventory of
    certctl_certificate_* gauges + certctl_issuance_duration_seconds
    per-issuer-type histogram + certctl_uptime_seconds.
  - Prometheus library vs hand-rolled exposition: explicit scope
    statement — hand-rolled fmt.Fprintf is intentional for v2.x given
    the shallow metric surface; client_golang migration tracked as
    v3 item (closes OPS-M1).
  - Tracing: explicit deferral — no OTel SDK setup, OTel packages
    are indirect-only in go.mod, no spans, no OTLP exporter; tracked
    as v3 item; in the meantime structured logs carry request_id and
    certctl_issuance_duration_seconds carries the per-issuer latency
    signal (closes OPS-M2).
  - Logging: structured JSON via log/slog; CERTCTL_LOG_LEVEL control;
    no key material / bearer tokens / session cookies in log lines.
  - Rate-limit semantics under restarts + replicas: per-process,
    in-memory, reset-on-restart, NOT shared across replicas; full
    inventory of the 5 limiter call sites (break-glass login,
    SCEP/Intune per-device, EST per-principal CSR, EST HTTP-Basic
    source-IP, ACME per-account); multi-replica + sticky-session
    implications; database-backed sliding window deferred to v3
    (closes D8).
  - Performance harness scope: cross-references the explicit
    'What it explicitly does NOT measure' list in
    deploy/test/loadtest/README.md (closes LOW-7 + finding 7).

docs/operator/runbooks/postgres-backup.md — operator-runnable
backup procedure:
  - Inventory of what to back up (DB + operator-managed file
    material that lives outside the DB: CA keys, RA keys, OCSP
    responder keys, trust bundles).
  - Logical backup recipe with docker-compose + Kubernetes variants,
    integrity verification step, off-host storage step.
  - Physical / PITR recipe pointing at pgbackrest / wal-g
    (certctl ships nothing here — standard PostgreSQL DBA work).
  - Three sample automation paths (in-cluster Postgres → S3 CronJob,
    managed Postgres PITR, self-hosted VM systemd timer + restic).
  - Quarterly restore-dry-run procedure.
  - Helm CronJob template deliberately not shipped — three
    documented reasons (deployment topology / secret-management
    integration / off-host storage all vary by operator) plus
    roadmap entry for shipping a starter template when a real
    operator asks for one (closes D6 + OPS-H1).

Both new docs wired into docs/README.md Operator + Runbooks tables.

D5 (ServiceMonitor) and T9 (canonical k6 load-test) were already
shipped in Bundle 3 (deploy/helm/certctl/templates/servicemonitor.yaml)
and in deploy/test/loadtest/ + .github/workflows/loadtest.yml
respectively; this bundle doesn't touch them — it just records the
closure in the audit HTML.

Verified:
  bash scripts/ci-guards/G-3-env-docs-drift.sh    # PASS
  bash scripts/ci-guards/doc-rot-detector.sh      # PASS
  All 35 scripts/ci-guards/*.sh green.
2026-05-13 02:09:11 +00:00
shankar0123 476022ca59 docs(b6): secret-custody reference + config-encryption upgrade runbook + private-key CI guard
Closes acquisition-diligence Bundle 6 findings on secret custody, config
encryption, and local artifact hygiene. Source IDs: S6, R4, SEC-M2,
RT-M1, RT-M2, RT-L1.

Surgical closures (artifact-only audit-framed memos stay out of the
public repo per the Bundle 5 lesson):

R4 / RT-L1 — local EC private key artifact
  rm cmd/agent/mc-001.key (gitignored, never in git history, leftover
  from a 2025-era agent dev run on the operator's workstation).
  Added scripts/ci-guards/B6-no-private-keys-in-tree.sh that fails the
  build if any TRACKED non-test file contains a PEM private-key block,
  so the next attempt to commit similar material gets caught at CI.
  Allowlist: *_test.go (hermetic-test PEMs), examples/*.md (sample
  walkthroughs), internal/scep/intune/testdata/ (certificates, not
  keys).

RT-M1 — landing-page HSM implication
  certctl.io/index.html: 'their hardware' / 'your hardware' colloquial
  comparisons rephrased to 'their custody' / 'your servers'. The phrase
  'Your keys. Your hardware. Your data. Your terms.' becomes 'Your
  keys. Your servers. Your data. Your terms.' to remove any inferred
  HSM-backed key-storage claim. The technical disclosure now lives in
  docs/operator/secret-custody.md (linked below); the landing page no
  longer makes a claim it cannot back.

S6 + SEC-M2 + RT-M2 (composite documentation closure)
  Added docs/operator/secret-custody.md — public operator reference
  enumerating every secret material on the control plane and on
  agents:
    - Local CA private key (FileDriver, file-on-disk, heap-resident
      with the L-014 carve-out documented in
      internal/connector/issuer/local/local.go).
    - Agent ECDSA P-256 keys (file on agent host, never transmitted).
    - OIDC client secret (AES-256-GCM v3, PBKDF2 600k).
    - Session signing key (same encryption regime).
    - Break-glass credential (Argon2id, never encrypted).
    - API-key bearer tokens (SHA-256 hash only; plaintext shown once).
    - CSR private keys mid-issuance (agent memory only).
    - Issuer-connector backend secrets (encrypted_config column,
      fail-closed for source='database', plaintext-by-design for
      source='env' with rationale).
  The Env-seeded-vs-DB-seeded plaintext policy is explained in plain
  text so a buyer review can independently verify the startup guard at
  cmd/server/main.go:222-262 makes sense.

  Added docs/operator/runbooks/config-encryption-upgrade.md — the
  procedural arm: how to force v1/v2 -> v3 re-seal across the
  database, plus the passphrase-rotation order. Documents the
  AEAD-driven read fallback (v3 -> v2 -> v1) and the fact that
  re-sealing happens passively on UPDATE. Open roadmap item: a
  certctl admin reseal --all command (tracked in
  WORKSPACE-ROADMAP.md).

  Both docs wired into docs/README.md Operator + Runbooks tables.

Verification:
  rg -n 'CONFIG_ENCRYPTION|encrypt|v1|private key|HSM|PKCS11|mc-001.key|\.key|Local CA' \
     internal cmd docs .gitignore README.md   # ambient (no NEW leaks)
  find . -name '*.key' \
     -not -path './.git/*' -not -path './web/node_modules/*'   # empty
  git ls-files | xargs grep -lE 'BEGIN .* PRIVATE KEY' \
     | grep -vE '_test\.go$|^examples/|^internal/scep/intune/testdata/'   # empty
  bash scripts/ci-guards/B6-no-private-keys-in-tree.sh   # PASS
  bash scripts/ci-guards/G-3-env-docs-drift.sh           # PASS
  bash scripts/ci-guards/doc-rot-detector.sh             # PASS

Residual roadmap (deliberately deferred):
  - signer.PKCS11Driver (HSM-token-backed CA-key custody).
  - signer.CloudKMSDriver (AWS/GCP/Azure KMS-backed CA-key custody).
  - FIPS 140-3 mode for the whole control plane.
  - HSM-backed session signing key.
  - Built-in 'certctl admin reseal --all' command.
  All five tracked in WORKSPACE-ROADMAP.md, not retracted.
2026-05-13 01:48:40 +00:00
shankar0123 5b151e74da docs: remove audit-bundle-flavored docs from public repo
Three docs added in Bundle 4 + Bundle 5 closure commits (750478a, 596e675)
were framed around acquisition-diligence audit findings and don't belong
in the public-facing operator docs tree:

- docs/operator/scheduler-ha.md         (Bundle 4 D2 per-loop HA truth table)
- docs/operator/rate-limit-scope.md     (Bundle 4 D3 scope statement)
- docs/operator/security-bundle-5-audit-closure.md  (Bundle 5 closure receipt)

Audit-bundle artifacts live in the operator's local cowork/ scratchpad,
not in docs/. The underlying code closures (advisory-lock migrations,
SSRF-guarded notifier transports, break-glass login limiter, MCP gating,
etc.) stand — only the audit-framed documentation surface is removed.

docs/README.md: drop the two table rows that pointed at the now-deleted
scheduler-ha.md + rate-limit-scope.md (added in 750478a, lines 77-78).
2026-05-13 01:35:24 +00:00
shankar0123 264015059d ci(guards): fix G-3 (CERTCTL_MCP_READ_ONLY phantom) + S-1 (hardcoded 45)
Two CI guards tripped on the B4 + B5 closure commits:

1. G-3 env-docs-drift caught `CERTCTL_MCP_READ_ONLY` mentioned in
   docs/operator/security-bundle-5-audit-closure.md (Bundle 5 S8
   row) without a corresponding entry in internal/config/config.go.
   The env var is a v3 idea, not a shipped feature — the doc now
   describes the future gate without naming the literal env var,
   matching the G-3 phantom-env-var contract.

2. S-1 hardcoded-source-counts caught "all 45 migrations" in
   docs/operator/scheduler-ha.md (Bundle 4 D8 closure prose). Per
   the CLAUDE.md operating rule "Numeric claims about current state
   rot", swapped the literal count for the rebuild command
   `ls migrations/*.up.sql | wc -l`.

Both fixes are doc-only — no code change, no test change. The
underlying Bundle 4 + Bundle 5 closures stand.

Verification:
  bash scripts/ci-guards/G-3-env-docs-drift.sh            # clean
  bash scripts/ci-guards/S-1-hardcoded-source-counts.sh   # clean
2026-05-13 01:24:06 +00:00
shankar0123 596e675ec7 fix(security): close BUNDLE 5 — auth, OIDC, MCP, API + browser security edges
Bundle 5 closure (2026-05-13 acquisition diligence audit). 13-finding
security audit pass across the auth / OIDC / MCP / API / browser-
security surface. Five real closures shipped in code, two false-as-
stated findings annotated with the existing implementation, three
operator-decision items documented for v3 follow-up, three doc-only
fixes (auth architecture narrative aligned with shipped OIDC).

Source findings closed (code):
  S1     break-glass /auth/breakglass/login lacked the documented
         5/min per-source-IP rate limit; handler now owns its own
         SlidingWindowLimiter wired at startup. Doc claim turns true.
  R6     OIDC test_discovery JWKS probe ran on http.DefaultClient;
         now uses an http.Client whose transport wraps
         validation.SafeHTTPDialContext. JWKS URI can no longer
         pivot into reserved-address ranges via DNS rebinding.
  R7     Slack + Teams notifiers built http.Client without the SSRF
         dial-time guard. Both New() constructors now install
         validation.SafeHTTPDialContext; webhook URLs (operator-
         configured via dynamic-config GUI) cannot dial 169.254.x or
         in-cluster reserved ranges. Test seam: newForTest bypasses
         the guard for httptest's 127.0.0.1 binds, mirroring the
         existing internal/connector/notifier/webhook pattern.
  RT-L2  CERTCTL_ACME_INSECURE=true now emits a prominent
         logger.Warn at server boot. Pre-Bundle-5 the knob silently
         disabled ACME directory TLS verification.

Source findings closed (doc):
  finding 1 + HIGH-5  Architecture doc claimed no in-process JWT/
         OIDC/mTLS/SAML and pointed everyone at the
         authenticating-gateway pattern. Auth Bundle 2
         (commit dea5053) shipped native OIDC + sessions +
         break-glass. New §"In-process authentication surface"
         table (api-key / oidc / none) supersedes the old framing;
         "Authenticating-gateway pattern (SAML, mTLS-as-auth,
         LDAP)" section retained for protocols certctl still
         doesn't ship natively.

Source findings verified false (existing implementation):
  S4     OIDC email-domain allowlist — `email_domain_test.go`
         already pins the strict-equality semantics (subdomain not
         auto-accepted, multi-entry no-match path, empty allowlist
         accepts all by-design per RFC 9700 §4.1.1).
  SEC-L1 CSP / HSTS / referrer-policy headers — already shipped at
         internal/api/middleware/securityheaders.go and wired at
         cmd/server/main.go L2003+L2027+L2115.

Operator-decision / deferred (tracked in bundle-5 closure doc):
  S3     CERTCTL_API_KEYS_NAMED parsing is wired, end-to-end
         validation is partial. Operator decides: complete the
         named-key middleware path or deprecate the syntax.
  S5     Audit-middleware best-effort for read paths;
         security-critical writes use WithinTx. Operator decides
         per-path escalation.
  S8     MCP threat model — the binary is a thin protocol bridge,
         no privileges of its own; every tool call carries
         CERTCTL_API_KEY and is auth'd + RBAC-gated server-side.
         Optional CERTCTL_MCP_READ_ONLY gate tracked as v3.
  SEC-H1 2026-05-10 audit CRIT-1/2/4 already closed on master;
         CRIT-3/5 status against the spec folder is operator-
         workstation-validation-only. Documented for follow-up.
  SEC-L2 WebAuthn / FIDO2 / step-up — already documented in
         docs/operator/auth-threat-model.md "Threats Bundle 2 does
         NOT close". v3 work item per CLAUDE.md decision 12.

Full per-finding rationale + receipts at
docs/operator/security-bundle-5-audit-closure.md.

Verification:
  gofmt -l                                                # clean
  go vet ./internal/connector/notifier/slack
    ./internal/connector/notifier/teams ./internal/auth/oidc
    ./internal/api/handler ./cmd/server                  # clean
  go build ./cmd/server [...]                            # clean
  go test -short -count=1 ./internal/connector/notifier/slack
    ./internal/connector/notifier/teams ./internal/api/handler
    ./internal/auth/oidc ./internal/config                # PASS
                                                          # (slack 0.028s + teams
                                                          # 0.023s + handler 11.0s;
                                                          # newForTest seam keeps
                                                          # httptest tests green)

Audit-Closes: BUNDLE-5 S1 R6 R7 RT-L2 finding-1 HIGH-5
Audit-Verifies-False: S4 SEC-L1
Audit-Defers: S3 S5 S8 SEC-H1 SEC-L2
2026-05-13 01:18:45 +00:00
shankar0123 750478a6fe fix(scale): close BUNDLE 4 — migrations, scheduler HA, rate-limits, scale receipts
Bundle 4 closure (2026-05-13 acquisition diligence audit). Closes the
"what happens under multi-replica" question cluster: migration runner
had no concurrency control + no applied-version ledger, 15 scheduler
loops had per-process idempotency but no cross-replica documentation,
rate limits were process-local without an operator-facing scope
statement, load-test scope explicitly omitted four hot paths without
linking them to a roadmap.

Source findings closed:
  HIGH-1 + D4 + finding 4                 (migration tracking)
  D8                                       (scheduler loop ownership)
  MED-1 + MED-2                            (rate-limit scope)
  T9 + LOW-7 + finding 7                   (load-test receipt scope)

Closures by source ID:

HIGH-1 + D4 + finding 4 — Migration tracking + advisory lock.
internal/repository/postgres/db.go::RunMigrations now wraps every
migration execution in:
  1. A dedicated *sql.Conn pinned to one connection for the entire
     scan + apply lifecycle (pg_advisory_lock is connection-scoped).
  2. pg_advisory_lock(migrationAdvisoryLockID) — fixed int64 key
     derived from "certctl-migrations" so the same constant resolves
     across deployments without colliding with operator advisory
     locks. Blocks the second replica until the first finishes.
  3. CREATE TABLE IF NOT EXISTS schema_migrations(version TEXT PK,
     applied_at TIMESTAMPTZ DEFAULT NOW()) — audit ledger.
  4. Skip-applied loop: SELECT version FROM schema_migrations →
     map[string]struct{} → skip every .up.sql whose filename is in
     the map. INSERT after successful execute, ON CONFLICT
     (version) DO NOTHING for defense in depth.

Pre-Bundle-4 every server boot re-ran all 45 .up.sql files. The
"idempotency via IF NOT EXISTS / ON CONFLICT" contract in CLAUDE.md
held per-migration but offered no protection when two Helm replicas
raced on schema DDL. Post-Bundle-4 single-replica deploys see zero
behavior change beyond the audit-table population; multi-replica
deploys get HA-safe schema bootstrap.

D8 — Scheduler HA semantics documented.
New docs/operator/scheduler-ha.md with per-loop inventory of all 15
loops in internal/scheduler/scheduler.go. Classification:
  - HA-safe (jobProcessorLoop, jobRetryLoop) — FOR UPDATE SKIP
    LOCKED via ClaimPendingJobs (Bundle 1 H-6 closure, 3e78ecb).
  - HA-safe-ish (jobTimeoutLoop) — atomic UPDATE-WHERE-status.
  - Idempotent under N>1 replicas (renewalCheckLoop,
    agentHealthCheckLoop, shortLivedExpiryCheckLoop, networkScanLoop,
    healthCheckLoop, acmeGCLoop, sessionGCLoop) — duplicate ticks
    produce idempotent side effects.
  - Side-effect-duplicating under N>1 replicas
    (notificationProcessLoop, notificationRetryLoop, digestLoop,
    cloudDiscoveryLoop, crlGenerationLoop) — duplicate
    webhook/email/AWS-API/CRL-signing operations. Operators
    running multi-replica accept N× side effects or pin to
    server.replicas: 1.

Leader-election work tracked in WORKSPACE-ROADMAP.md as v3.

MED-1 + MED-2 — Rate-limit scope.
New docs/operator/rate-limit-scope.md states the contract verbatim:
process-local sync.Mutex-guarded sliding-window log, effective
cluster-wide cap = configured-per-replica × server.replicas,
restart-safe (no persistent state, no shared store), bounded
(50k/100k key cap with eviction). Five call sites documented:
ocspLimiter (1m/IP), exportLimiter (1h/actor), EST per-principal
(24h/CN), EST failed-auth (1h/IP), Intune dispatcher
(24h/Subject+Issuer), plus the HTTP middleware token-bucket
(RPS+Burst per replica). Cluster-wide shared limits via Redis or
Postgres-backed bucket are tracked in WORKSPACE-ROADMAP.md as v3.

T9 + LOW-7 + finding 7 — Load-test receipt scope.
The existing harness at deploy/test/loadtest/ already
self-documents the gap ("What it explicitly does NOT measure"). No
code change needed for this finding; Bundle 4 cross-references
scheduler-ha.md and rate-limit-scope.md from those gap callouts so
the four deferred coverage classes (issuer connector, scheduler
throughput, agent fleet, DB p99) land in the same place an
acquirer reads about HA semantics and rate limits.

Tests:
  internal/repository/postgres/migrations_test.go (new, 4 tests):
    - TestRunMigrations_PopulatesSchemaMigrations: audit table
      exists and is non-empty after the first migration run.
    - TestRunMigrations_SkipsAppliedOnSecondCall: second call is
      observable no-op on row count.
    - TestRunMigrations_ConcurrentCallsSerialized: two goroutines
      racing the migrator both return without error; row count
      unchanged; no duplicate versions.
    - TestRunMigrations_FreshDatabaseHappyPath: ≥ 30 migrations
      land on a fresh schema.
  Gated by testcontainers via the existing repo_test.go getTestDB
  pattern; skipped under -short. The integration lane runs them.

Verification:
  gofmt -l                                              # clean
  go vet ./internal/repository/postgres ./cmd/server    # clean
  go build ./cmd/server ./internal/repository/postgres  # clean
  go test -short -count=1 ./internal/repository/postgres
    ./internal/ratelimit                                # PASS
  Operator follow-up: full integration run on workstation:
    go test -count=1 ./internal/repository/postgres -run TestRunMigrations_

Receipts (paths for the audit packet):
  Migration runner evidence: internal/repository/postgres/db.go
    L135-340 (advisory-lock + ledger + skip-applied loop) +
    internal/repository/postgres/migrations_test.go (4 tests).
  Scheduler loop inventory: docs/operator/scheduler-ha.md (15-loop
    table with HA classification per loop).
  Rate-limit storage matrix: docs/operator/rate-limit-scope.md.
  Load-test baseline: deploy/test/loadtest/README.md (already
    self-documenting), cross-linked from scheduler-ha.md.

Remaining operator warnings (deferred, tracked in WORKSPACE-ROADMAP.md):
  - Leader election for the four duplicate-side-effect loops
    (notificationProcessLoop, notificationRetryLoop, digestLoop,
    cloudDiscoveryLoop, crlGenerationLoop). v3 work item.
  - Shared rate-limits across replicas (Redis / Postgres token
    bucket). v3 work item.
  - Issuer-connector + scheduler-throughput + agent-fleet + DB-p99
    load-test coverage. Tracked separately; per-issuer Prometheus
    histograms already capture issuer round-trip latency in
    production runs.

Audit-Closes: BUNDLE-4 HIGH-1 D4 D8 MED-1 MED-2 T9 LOW-7 finding-4 finding-7
2026-05-13 01:00:39 +00:00
shankar0123 a849c8b8cf fix(security): close BUNDLE 2 — safe first run, demo mode, agent bootstrap
Bundle 2 closure (2026-05-12 acquisition diligence audit). Closes the
"docker compose up == accidental production" hazard: pre-Bundle-2 the
base deploy/docker-compose.yml WAS the demo path (AUTH_TYPE=none +
DEMO_MODE_ACK=true + KEYGEN_MODE=server + DEMO_SEED=true + literal
change-me-... placeholder creds), the README claimed "drop the demo
overlay for a clean install", and ENVIRONMENTS.md table documented
auth-type default as api-key — three contradictory stories layered on
the same compose file.

Source findings closed:
  R2 R3 C1 D9 finding-2 S9               (repo audit)
  SEC-H2 SEC-M1 SEC-M3 OPS-M3 LOW-5 HIGH-6 (cowork audit)

Compose split (deploy/docker-compose.yml + deploy/docker-compose.demo.yml):
The base now ships production-shaped — no AUTH_TYPE override, no
KEYGEN_MODE override, no DEMO_MODE_ACK, no DEMO_SEED, no literal
placeholder fallbacks. POSTGRES_PASSWORD / CERTCTL_AUTH_SECRET /
CERTCTL_CONFIG_ENCRYPTION_KEY / CERTCTL_API_KEY / CERTCTL_AGENT_ID
must come from deploy/.env (sample template in deploy/.env.example +
root .env.example). The demo overlay carries the full demo posture
(every env var + every placeholder credential) so the
`-f docker-compose.demo.yml` one-flag flip remains a zero-config
populated-dashboard path.

Fail-closed startup guards (internal/config/config.go::Validate):
Three new gates layered on the existing HIGH-12 demo-mode listen-bind
guard. All three exempt CERTCTL_DEMO_MODE_ACK=true so the demo overlay
keeps working:
  • HIGH-6:  AUTH_SECRET = "change-me-in-production"        → refuse
  • HIGH-6:  CONFIG_ENCRYPTION_KEY = "change-me-32-char..." → refuse
  • LOW-5:   CORS_ORIGINS contains "*"  (CWE-942 + CWE-352) → refuse

Visible DEMO MODE banner (cmd/server/main.go): every boot under
DEMO_MODE_ACK=true now emits a prominent WARN line with a 6-step
production-promotion checklist. The 2026-04-19 incident (a screenshot
run that kept running for three days) drove this; the per-startup
banner makes the posture unmissable in any log scraper.

Agent enrollment doc alignment:
  • docs/reference/configuration.md L83: corrected the non-existent
    URL `POST /api/v1/agents/register` to the real route
    `POST /api/v1/agents`; added the bootstrap-token note and the
    install-agent.sh handoff sequence.
  • docs/reference/architecture.md L154: replaced "agents register
    themselves at first heartbeat" (false — cmd/agent/main.go fail-
    fasts when CERTCTL_AGENT_ID is unset) with the actual two-step
    operator-driven flow (REST or GUI registration first, returned ID
    fed to install-agent.sh second).

Tests + CI guard:
  • 9 new TestValidate_Bundle2_* cases in internal/config/config_test.go
    covering: placeholder-secret refused + demo-ack exempt; placeholder
    encryption-key refused + demo-ack exempt; real key not mistaken for
    placeholder; wildcard CORS refused + demo-ack exempt; wildcard mixed
    into a concrete allowlist still refused; concrete allowlist accepted.
  • scripts/ci-guards/B2-compose-base-no-demo-env.sh: greps the base
    compose for any of the demo-mode env vars + placeholder credentials.
    Comments stripped before checking so the narrative header in the
    base file can still reference the overlay's posture in prose.

Cold-DB CI smoke (.github/workflows/ci.yml::cold-db-compose-smoke):
Switched to layering -f docker-compose.demo.yml on top of the base —
the new production base requires real env vars the smoke doesn't have,
and the smoke's purpose (catch migration-on-cold-DB regressions + the
bootstrap-token mint path) is orthogonal to which auth posture the
boot lands in.

Receipts:
  • Current first-run truth table
        compose flag                                  → posture
        -f docker-compose.yml                          (production)
                                                       → requires .env;
                                                       fail-fasts on
                                                       missing AUTH_SECRET
                                                       / CONFIG_ENCRYPTION
                                                       _KEY / POSTGRES
                                                       _PASSWORD; agent
                                                       fail-fasts on
                                                       missing AGENT_ID
        -f docker-compose.yml -f docker-compose.demo.yml  (demo)
                                                       → zero-config;
                                                       AUTH_TYPE=none +
                                                       DEMO_MODE_ACK=true
                                                       + KEYGEN=server +
                                                       DEMO_SEED=true;
                                                       boot banner WARN
        -f docker-compose.yml -f docker-compose.dev.yml   (dev)
                                                       → base + PgAdmin
                                                       + debug logging
        -f docker-compose.test.yml                     (test, standalone)
                                                       → production-shape
                                                       posture, real CA
                                                       backends
  • Verification (PATH=/tmp/go/bin export GO* paths to /tmp):
        gofmt -l                                      # clean (no diffs)
        go vet ./internal/config ./cmd/server         # clean
        go test -short -count=1 ./internal/config/... # PASS (cumulative +
                                                       all 9 new Bundle 2
                                                       cases green)
        go test -short -count=1                       # PASS (no regression
            ./internal/connector/target/configcheck    in the Bundle 1 -
                                                       closure tests)
        go build ./cmd/server ./cmd/agent             # clean
            ./cmd/cli ./cmd/mcp-server
        bash scripts/ci-guards/B2-compose-base-no-demo-env.sh  # clean
        bash scripts/ci-guards/H-1-encryption-key-min-length.sh # clean
        bash scripts/ci-guards/G-3-env-docs-drift.sh           # clean

Remaining operator warnings (not blocking; tracked in CLAUDE.md
"Open decisions"):
  • The first `docker compose -f docker-compose.yml up -d` against a
    pre-Bundle-2 .env (placeholder values still in place) will now
    fail-fast. This is the intended posture but operators upgrading
    from v2.0.x via .env-from-old-master need to rotate before
    upgrading. The CHANGELOG note for the v2.1.0 release should
    call this out alongside Auth Bundle 2's other breaking changes.

Audit-Closes: BUNDLE-2 R2 R3 C1 D9 S9 SEC-H2 SEC-M1 SEC-M3 OPS-M3 LOW-5 HIGH-6
2026-05-13 00:14:59 +00:00
shankar0123 d60a0ac297 fix(security): close BUNDLE 1 — server+agent connector config validation chain
Bundle 1 closure (2026-05-12 acquisition diligence audit). Closes the
acquisition-blocker chain: target.edit (default r-operator grant per
migrations/000029_rbac.up.sql:196) → arbitrary reload_command stored
without validation → agent createTargetConnector json.Unmarshal-only
→ sh -c on agent host. README's 'shell injection prevention on all
connector scripts' claim is now true at the chain level.

Server-side: new internal/connector/target/configcheck package + a
configcheck.Validate call in target.go::Create + ::Update +
::CreateTarget + ::UpdateTarget (all 4 entry points). Rejects shell
metacharacters in reload_command / validate_command / restart_command
for nginx, apache, haproxy, postfix/dovecot, javakeystore, ssh. Sentinel
errors.Is(err, service.ErrInvalidConnectorConfig) available for handler
400 mapping. Non-shell connector types (F5, IIS, Caddy, Traefik, Envoy,
cloud targets, K8s) are no-ops by design.

Agent-side: defense-in-depth connector.ValidateConfig(ctx, configJSON)
call in cmd/agent/main.go inserted between createTargetConnector and
DeployCertificate. This catches (a) configs pre-dating the server gate,
(b) encrypted-blob tampering, (c) per-connector filesystem invariants
that the server can't check.

F5 (S2 finding): proven docs-vs-code drift, not a security bug. The
applyDefaults function never set Insecure=true; runtime default has
always been Go zero-value (false → TLS verified). Three lying 'default
true' comments in f5/f5.go (lines 30, 45-47, 126) rewritten to match
actual code behavior.

Docs (C4 + C9): README L12 + L68 narrowed — 'any CA / any server' →
'Twelve native CA connectors plus an OpenSSL adapter; fifteen native
deployment-target connectors plus a proxy-agent pattern.' 'Every deploy
goes through atomic-write + ...' narrowed to file-based connectors with
inline link to per-target guarantee matrix. New deployment-model.md §1.6
ships a 15-target × 8-property guarantee table covering atomic write /
owner-perms / SHA-256 idempotency / pre-deploy snapshot / on-failure
rollback / post-deploy TLS verify / Prometheus counters / shell-injection
validation — including the K8s preview honesty marker (CLAIM-H4).

Tests: internal/connector/target/configcheck/configcheck_test.go covers
14 shell-injection payloads (semicolon, pipe, backtick, dollar-paren,
redirect, and-chain, newline, double-quote, escape, dollar-var) × 7
shell-using connectors + benign-command acceptance + non-shell no-op
behavior + empty config + malformed JSON. All pass.

Verification (run from /sessions/gifted-blissful-pasteur/mnt/cowork/certctl):
  go fmt ./...              # clean (no diffs)
  go vet ./...              # clean (no findings)
  go test -short -count=1 ./internal/... ./cmd/...
                            # 60+ packages all ok, zero FAIL

Audit-Closes: BUNDLE-1 RT-C1 SEC-M4 CLAIM-M2 CLAIM-L3
Audit-Verifies-False: S2 (F5 'default insecure' was a comment lie, code was always secure)
2026-05-12 23:48:08 +00:00
shankar0123 aedf19d128 ci(cold-db-smoke): inline into workflow; remove the script (operator: not a per-commit gate)
Operator pushback: 'I don't want a smoke test I have to manually run
every time I commit.' Correct read — the script existed for local
debugging but its presence in scripts/ci-guards/ implied 'operator
runs this regularly,' which is the opposite of the design intent.

Changes:

- Removed scripts/ci-guards/cold-db-compose-smoke.sh.
- Inlined the smoke logic directly into the
  cold-db-compose-smoke job in .github/workflows/ci.yml. Same
  semantics: docker compose down -v -> up -d -> wait-healthy ->
  bootstrap admin -> issue/renew/revoke -> assert audit rows ->
  teardown. 15-min wall-clock cap. Logs dump on failure.
- Removed the cold-db-compose-smoke.sh skip case from the generic
  regression-guards loop (no longer needed).
- Updated scripts/ci-guards/README.md and
  docs/contributor/ci-guards.md to reflect the new shape: 'lives in
  the workflow, not as a script.'

Workspace docs updated (cowork/WORKSPACE-CHANGELOG.md,
cowork/CLAUDE.md, cowork/auditable-codebase-bundle/RESULTS.md).

The gate is unchanged: CI runs the smoke on every push, master
branch-protection enforces it as a required check. Operator's
manual action is once — adding the check to branch-protection.

Audit-Closes: post-v2.1.0-anti-rot/item-6
2026-05-12 14:22:19 +00:00
shankar0123 9f7b5d89a5 docs(contributor): document the Auditable Codebase Bundle guards
Three doc changes for the bundle's discoverability:

1. New docs/contributor/ci-guards.md (185 lines)
   Entry-point doc for new contributors. Explains the four categories
   of guards (code-shape, contract-parity, build/dep, operational),
   the discipline that keeps them honest (allowlist + expiration),
   and how to add a new one. Cross-references scripts/ci-guards/README.md
   for the exhaustive list.

2. scripts/ci-guards/README.md — added a 'Forward-looking guards'
   subsection naming complete-path-config-coverage, doc-rot-detector,
   and cold-db-compose-smoke with their item references + a
   one-sentence description of what each catches. Replaced the
   stale '22 guards' header with 'Count: re-derive via ls' per the
   no-version-stamped-numbers convention from CLAUDE.md.

3. docs/README.md — wired ci-guards.md into the Contributor section
   navigation table.

Bumped 'Last reviewed:' to 2026-05-12 on the two docs touched
(docs/README.md, docs/contributor/ci-pipeline.md).

Verified: doc-rot-detector.sh green at 91 docs scanned, 89 dated, 0
warns, 0 fails.

Audit-Closes: post-v2.1.0-anti-rot/item-1
Audit-Closes: post-v2.1.0-anti-rot/item-2
Audit-Closes: post-v2.1.0-anti-rot/item-5
Audit-Closes: post-v2.1.0-anti-rot/item-6
2026-05-12 14:15:13 +00:00
shankar0123 56e2ea1ad7 docs: v2.1.0 release polish — strip internal bundle/phase tags, update status for OIDC ship
README:
- Rewrite Status block: drop the stale 'federated identity not yet
  shipped' line; flag v2.1.0 OIDC + sessions + back-channel logout
  + break-glass as early-access; encourage GitHub issues for IdP
  rough edges. (A1 framing — keep early-access umbrella, no
  SAML/WebAuthn/JIT roadmap teaser.)
- Add OIDC SSO bullet to 'What it does' covering per-IdP runbooks,
  group-claim → role mapping, AES-256-GCM client_secret encryption,
  JWKS auto-refresh, PKCE-S256, RFC 9700 §4.7.1 pre-login binding,
  RFC 9207 iss check, __Host- cookies, CSRF rotation, idle+absolute
  expiry, BCL, break-glass admin.
- Update Security paragraph: three auth paths (API keys / OIDC /
  break-glass), HMAC-signed sessions, CSRF rotation, RFC OIDC BCL.
- Correct CI coverage thresholds against
  .github/coverage-thresholds.yml (service 70%, handler 75%,
  crypto 88%, auth packages 85-95%); 'static analysis' replaces
  the inflated '11 linters' claim (actual count is 4 active).

Docs B3 sweep — strip operator-facing 'Bundle N' / 'Phase N' tags:
- docs/operator/auth-threat-model.md — rewrite intro; rename 5 H2
  sections (API-key + RBAC defenses / OIDC + sessions + break-glass
  defenses / OIDC + sessions threat catalogue / Closed federated-
  identity threats / Future-work threats); clean ~12 H3/prose hits.
- docs/operator/rbac.md — strip Bundle 1 framing from intro,
  scope_id deferral note, MCP tools section, day-0 bootstrap, and
  'Where to look next'.
- docs/operator/auth-benchmarks.md — drop 'Phase 14' framing from
  title intro, hardware floor caption, result table caption,
  methodology, and pre-merge audit section.
- docs/operator/security.md — already cleaned earlier this session
  (RBAC / day-0 / approval-bypass / OIDC federation / sessions /
  OIDC first-admin / break-glass H3s).
- docs/operator/oidc-runbooks/{index,keycloak,authentik,okta,
  azure-ad}.md — strip Auth Bundle 2 framing + Phase 10/3/4
  references; replace with feature-name prose.
- docs/operator/legacy-clients-tls-1.2.md — drop Bundle F / M-023
  audit-reference framing; keep CWE-326.
- docs/operator/database-tls.md — drop Bundle B / M-018 framing
  from intro + Helm section.
- docs/operator/runbooks/disaster-recovery.md — drop 'Production
  hardening II Phase 10' status callout.
- docs/migration/oidc-enable.md — retitle 'Enable OIDC SSO';
  strip Bundle 1/2 framing from prereqs, troubleshooting, related
  docs; update __Host- cookie callout from 'audit MED-14' to
  v2.1.0-BREAKING.
- docs/migration/api-keys-to-rbac.md — strip Bundle 1 framing from
  intro, migration table, IsAdmin section, and cross-references.
- docs/migration/acme-from-cert-manager.md — strip residual
  'Phase 5' tags from cert-manager integration test references.
- docs/reference/configuration.md — retitle Auth section.
- docs/reference/profiles.md — strip Bundle 1 Phase 9 framing
  from RequiresApproval section + Related list.
- docs/reference/auth-standards-implemented.md — rewrite intro
  (API-key + RBAC + OIDC + sessions + back-channel logout +
  break-glass); rename 'Bundle 1 (RBAC) standards covered
  separately' H2; clean per-row Phase references.
- docs/README.md — rewrite nav-table entries to drop Bundle 1/2
  parentheticals; retitle 'Enable OIDC SSO' migration entry.

No code or test changes; pure operator-facing prose polish for
the v2.1.0 tag.
2026-05-11 16:54:07 +00:00
shankar0123 fefeccfa59 harden(oidc): relax alg-downgrade IdP-bind check to intersection-empty (Keycloak compat)
Phase-10 live-IdP smoke (Keycloak 26.x via testcontainers-go) revealed
the IdP-bind alg-downgrade check was too strict for real-world IdPs.
6 of the integration tests in internal/auth/oidc/integration_keycloak*_test.go
were failing with:

  oidc: IdP advertises weak signing algorithms (HS*/none);
  refusing to use as defense against downgrade attacks: HS256

Keycloak 26.x (and several other real-world IdPs — Auth0 when HS-mode is
enabled, some Authentik configs) advertise EVERY alg they're capable of
in the discovery doc's id_token_signing_alg_values_supported field, even
when the realm only signs with RS256 in practice. Pre-fix the IdP-bind
check refused on ANY HS* or 'none' advertisement → no real Keycloak deploy
could ever bind a provider row, hence the integration-test failures.

The strict-deny check was defense-in-depth on top of the load-bearing
per-token alg-pin at sig-verify time (isDisallowedAlg, service.go L1177):
that check rejects every ID token whose JWS header carries an alg outside
DefaultAllowedAlgs, regardless of what the discovery doc advertises.
A forged HS256 token signed with the IdP's RS256 pubkey as HMAC secret
is rejected at sig-verify time → the actual algorithm-confusion attack
is closed by the per-token pin, NOT by the discovery-doc check.

Fix: relax the IdP-bind check to refuse only when the intersection of
advertised vs DefaultAllowedAlgs is EMPTY (the pathological all-weak-alg
IdP case). Keycloak (RS256 + HS256 advertised) now binds successfully;
an HS-only IdP still fails closed.

Changes:
- internal/auth/oidc/service.go: rewrite the alg-check loop at L1067 in
  getOrLoad / RefreshKeys to compute the intersection set; refuse only
  when no acceptable alg is advertised. ErrIdPDowngradeAdvertised
  docstring updated to reflect new contract. DefaultAllowedAlgs
  docstring + the package-level design-comment block at L40-72 updated
  with v2.1.0-relaxed semantics callouts.
- internal/auth/oidc/test_discovery.go: TestDiscovery dry-run validator
  rewritten to surface HS*/none alongside RS* as an informational note
  ('note: IdP advertises weak algorithms %v alongside acceptable ones')
  rather than a hard-fail error. HS-only / none-only still hard-fails.
- internal/auth/oidc/service_test.go: TestService_IdPDowngradeDefense_*
  tests updated. Renamed:
  - RejectsHSAdvertised → RS256PlusHS256_BindsSuccessfully (positive)
  - RejectsNoneAdvertised → RejectsHSOnlyAdvertised (intersection-empty)
  - RefreshKeys_CatchesPostLoadDowngrade rotated to HS-only post-load
- internal/auth/oidc/coverage_fill_test.go: TestTestDiscovery_AlgDowngradeDetected
  split into _HS256AlongsideRS256_BindsWithNote (positive, asserts note
  but no hard-fail) + _HSOnly_StillTrips_HardFail (intersection-empty).
- docs/operator/auth-threat-model.md: OIDC token-validation alg-allow-list
  section rewritten to call out the load-bearing-defense hierarchy
  (per-token pin first, IdP-bind check defense-in-depth) and document
  the v2.1.0 relaxation rationale.
- CHANGELOG.md: ### Security entry under Unreleased.

Verify: go test ./internal/auth/oidc/ -short PASS; gofmt clean; go vet
clean. The Keycloak integration tests should now pass when the operator
re-runs 'make keycloak-integration-test'.
2026-05-11 15:34:59 +00:00
shankar0123 eee124efb6 chore(ci-guards): close 4 CI-guard regressions surfaced by v2.1.0 release-gate Phase 5
Four scripts/ci-guards/*.sh trips on dev/auth-bundle-2 vs master:

1. G-3-env-docs-drift: 10 CERTCTL_* env vars added by Auth Bundle 2 +
   audit-2026-05-10/11 fix bundle were not in docs/. Added a new 'Auth
   (Bundle 1 + Bundle 2)' section to docs/reference/configuration.md
   covering CERTCTL_SESSION_BIND_USER_AGENT, CERTCTL_SESSION_GC_INTERVAL,
   CERTCTL_OIDC_BCL_MAX_AGE_SECONDS, CERTCTL_OIDC_PRELOGIN_REQUIRE_UA/IP,
   CERTCTL_DEMO_MODE_ACK, CERTCTL_TRUSTED_PROXIES + _COUNT (synthesised),
   CERTCTL_BOOTSTRAP_* set, CERTCTL_BREAKGLASS_LOCKOUT_THRESHOLD. Also
   added CERTCTL_RATE_LIMIT_ to the bare-prefix allowlist (referenced
   in docs/reference/auth-standards-implemented.md prose).

2. bundle-8-M-009-bare-usemutation: BreakglassPage shipped 3 bare
   useMutation() calls instead of useTrackedMutation. Migrated all
   three to useTrackedMutation with invalidates: [['breakglass']].

3. multi-tenant-query-coverage: Defense-in-depth tenant_id additions
   in the fix bundle dropped the missing-tenant-id query count from 32
   to 31. Ratcheted baseline 32 -> 31 (forward-only invariant).

4. openapi-handler-parity: 28 new REST endpoints from Bundle 2 + the
   fix bundle missing from api/openapi.yaml. Added them to
   api/openapi-handler-exceptions.yaml with per-route 'why:'
   justifications. OpenAPI schema generation deferred to pre-v2.2.0
   alongside the GUI E2E coverage push; threat model + handler
   contracts already live in docs/operator/{rbac,auth-threat-model,
   oidc-runbooks}.md.

After this commit every script in scripts/ci-guards/*.sh exits 0.
2026-05-11 14:19:35 +00:00
shankar0123 a923cf697c harden(auth): demo-mode residual-grants detector + cleanup endpoint + CI guard (A-8)
Audit 2026-05-11 A-8 closure. Closes the deferred Phase 2 leg of the
2026-05-10 HIGH-12 closure (2e97cc1) — production-startup observability
for actor-demo-anon residual grants + CI guard banning new synthetic-
admin code paths.

What this changes:

* cmd/server/preflight_demo_residual.go (new) runs after the DB pool +
  audit service are constructed and before the HTTPS listener starts.
  Under any non-'none' auth type it queries actor_roles for the
  synthetic actor-demo-anon and emits a WARN log + a categorized audit
  row (auth.demo_residual_grants_detected) listing every grant
  present. Migration 000029 unconditionally seeds the ar-demo-anon-admin
  row at install time, so EVERY production deploy will see this WARN
  on first boot; the intended cutover workflow is cleanup-once at
  production handover.

* CERTCTL_DEMO_MODE_RESIDUAL_STRICT (new env var on AuthConfig,
  default false) pivots the WARN to fail-closed startup refusal for
  operators who want a paranoid posture against re-seeding.

* POST /api/v1/auth/demo-residual/cleanup (new handler at
  internal/api/handler/demo_residual.go) is an admin-class
  (auth.role.assign) endpoint that removes every actor-demo-anon row
  from actor_roles and returns {removed: int64}. Idempotent; refuses
  503 under Auth.Type=none (deleting the row would break the demo
  path); audit-logs every invocation including no-op zero-removed
  calls so the admin's action is always recorded.

* scripts/ci-guards/no-new-synthetic-admin.sh pins the 17-entry
  allowlist of source files that legitimately reference the
  actor-demo-anon literal. New runtime code paths that resolve to the
  synthetic actor (the same pattern that produced the original CRIT
  class) are rejected at PR time. CI workflow auto-picks the script
  via the existing scripts/ci-guards/*.sh loop in .github/workflows/
  ci.yml; no workflow edit needed.

Regression matrix:

* cmd/server/preflight_demo_residual_test.go — 7 tests covering the
  4 main behaviour branches (testcontainers-backed, testing.Short()-
  skipped: DemoModeActive_Skips, NoResidue_Passes, HasResidue_LogsAnd
  Audits, StrictMode_RefusesStartup, DeleteDemoAnonResidue_Idempotent)
  plus 3 pure-Go stdlib unit tests for the row-string formatter +
  nil-safety contracts on both helpers.

* internal/api/handler/demo_residual_test.go — 7 stdlib+httptest
  cases: HappyPath, Idempotent_ReturnsZero, RejectsInDemoMode (503),
  CleanupError_Surfaces500, NilCleanupFn (defensive 500),
  NilAuditWriter_DoesNotPanic, MissingActorContext (falls back to
  'unknown' actor in the audit row).

* internal/api/router/openapi_parity_test.go — new
  POST /api/v1/auth/demo-residual/cleanup entry plus 6 pre-existing
  pre-A-8 entries (oidc/test, jwks-status, users CRUD, runtime-config)
  that had drifted out of SpecParityExceptions; the parity test was
  red on dev/auth-bundle-2 before my work; this commit returns it to
  green with full per-entry justifications + parity-debt notes.

Docs:

* docs/operator/security.md — new 'Demo-to-production cutover (Audit
  2026-05-11 A-8)' section explaining the WARN message, the cleanup
  curl one-liner, the equivalent SQL, the strict-mode env var, and
  the CI guard.

* docs/operator/rbac.md — Last-reviewed bump + pointer to the new
  env var + the security.md section.

* cowork/auth-bundles-audit-2026-05-10.md — HIGH-12 row gains an
  'A-8 follow-on CLOSED 2026-05-11' annotation describing the
  deferred Phase 2 leg now landed.

* CHANGELOG.md — Unreleased ### Security entry summarizing the four
  legs (detector + cleanup + strict-mode flag + CI guard) and the
  acquisition-readiness narrative this closes.

Operator-facing impact: this closes a credibility gap, not an
exploitable vulnerability. The residue requires a regression
elsewhere in the middleware chain to be exploitable. After this
fix, the canonical narrative ('RBAC primitive with no synthetic-
admin fallback') is fully true.

Refs cowork/auth-bundles-fixes-2026-05-11/08-high-demo-mode-residual-
cleanup.md.
2026-05-11 11:45:54 +00:00
shankar0123 0152bdf567 fix(auth/rbac): scope-aware ActorRole revoke (A-4)
HIGH-10's UNIQUE (actor, role, scope_type, scope_id, tenant) uniqueness
extension lets an operator grant the same role to the same actor at
multiple scopes (e.g. r-operator on profile=p-acme AND profile=p-globex).
But ActorRoleRepository.Revoke's WHERE clause omitted (scope_type,
scope_id) — a single call deleted every variant. Selective revoke was
unrepresentable; operators had to drop all and re-grant N-1, opening
a race window where the actor's access was briefly different.

Closure across all layers (handler → service → repo → MCP → GUI client),
preserving the legacy "revoke all variants" contract for unmodified
callers:

  internal/repository/auth.go
    - New ActorRoleRevokeOptions struct. Zero value = legacy semantic;
      non-empty ScopeType narrows to one variant.
    - New ErrActorRoleNotFound sentinel for scoped no-match (HTTP 404).

  internal/repository/postgres/auth.go
    - Revoke signature extended with opts. Empty opts.ScopeType uses
      the legacy SQL (no scope WHERE), zero-row delete = no error.
    - Non-empty narrows with `scope_type = $5 AND scope_id IS NOT
      DISTINCT FROM $6` — the IS-NOT-DISTINCT-FROM is load-bearing,
      vanilla `=` would silently miss the (global, NULL) case because
      NULL ≠ NULL in standard SQL.
    - Selective revoke with zero matching rows returns
      ErrActorRoleNotFound; operators get feedback on typos.

  internal/service/auth/actor_role_service.go
    - Revoke takes opts. Audit row's details map records the scope so
      SIEMs can distinguish wide-vs-selective revokes:
      `scope: "all_variants"` for the legacy path, or
      `scope_type` + `scope_id` for selective. Privilege check
      (auth.role.assign) and reserved-actor guard unchanged.

  internal/api/handler/auth.go
    - RevokeRoleFromKey parses optional `?scope_type=` / `?scope_id=`
      query params via new parseRevokeScope helper.
    - Validation mirrors AssignRoleToKey: scope_id forbidden with
      scope_type=global, required with profile/issuer, invalid
      scope_type → 400. scope_id without scope_type also → 400.
    - writeAuthError maps ErrActorRoleNotFound to 404.

  internal/mcp/tools_auth.go + types.go
    - AuthRevokeKeyRoleInput gains optional ScopeType + ScopeID with
      jsonschema descriptions explaining the dual-mode contract.
    - Tool call site appends URL-encoded query params when ScopeType
      is set; legacy callers (no scope_type) emit the bare DELETE
      path unchanged.

  web/src/api/client.ts
    - authRevokeKeyRole signature: optional 3rd argument
      `{ scope_type?, scope_id? }`. Pre-A-4 call sites (no opts arg)
      keep firing the bare DELETE — fully backward compatible. The
      GUI KeysPage's per-row revoke button (still one row per role,
      pre-Fix-12) continues to use the legacy shape; future GUI work
      can pass scope params for per-variant rows.

  docs/operator/rbac.md
    - New "Revoke: legacy 'all variants' vs scope-selective" subsection
      under "From the HTTP API" with curl examples for both modes plus
      the audit-row payload shape that lets SOC/SIEM tell them apart.

Regression coverage:

  Repository (testcontainers, skipped under -short — 6 tests in
  internal/repository/postgres/auth_revoke_scope_test.go):
    TestRevokeActorRole_NoOpts_RemovesAllVariants
    TestRevokeActorRole_WithScope_RemovesOnlyMatching
    TestRevokeActorRole_WithGlobalScope_RemovesOnlyGlobal — pins the
      IS-NOT-DISTINCT-FROM branch (global, NULL)
    TestRevokeActorRole_NoMatch_ReturnsNotFound — pins the new sentinel
    TestRevokeActorRole_NoOpts_NoMatch_IsNoOp — pins the legacy
      idempotence contract
    TestRevokeActorRole_IssuerScope_RemovesOnlyMatching — pin the
      issuer-scope half (profile + issuer are symmetric scope types)

  Handler (7 new tests in auth_test.go):
    TestAuthHandler_RevokeRoleFromKey — extended to assert no scope
      filter is forwarded when query string is empty (legacy behaviour)
    TestAuthHandler_RevokeRoleFromKey_A4_ScopedProfile
    TestAuthHandler_RevokeRoleFromKey_A4_ScopedGlobal
    TestAuthHandler_RevokeRoleFromKey_A4_RejectsScopeIDWithGlobal
    TestAuthHandler_RevokeRoleFromKey_A4_RejectsMissingScopeID
    TestAuthHandler_RevokeRoleFromKey_A4_RejectsScopeIDWithoutScopeType
    TestAuthHandler_RevokeRoleFromKey_A4_RejectsInvalidScopeType
    TestAuthHandler_RevokeRoleFromKey_A4_ScopedNotFoundReturns404

  MCP (2 new table rows in tools_per_tool_test.go):
    Scoped revoke with scope_type=profile + scope_id=p-acme →
      `?scope_type=profile&scope_id=p-acme`
    Scoped revoke with scope_type=global (no scope_id) →
      `?scope_type=global`

Service-layer test plumbing (service_test.go) updated for new opts
arg: 4 existing call sites pass repository.ActorRoleRevokeOptions{}
to keep their pre-A-4 semantics; the fakeActorRoleRepo.Revoke
implementation now mirrors the postgres scope-aware behaviour
(legacy zero-value vs scoped narrowing + ErrActorRoleNotFound on
no-match).

Verify gate green: gofmt clean, go vet clean, go test -short across
repository/postgres, service/auth, api/handler, and mcp. The
pre-existing KeysPage.test.tsx failure observed on the baseline
commit (reproduced via `git stash` earlier in Fix 03) is unrelated;
my client.ts change adds an optional third argument and is fully
backward-compatible.

Spec at cowork/auth-bundles-fixes-2026-05-11/04-high-actor-role-revoke-scope.md.
Audit doc updated: new row A-4 (2026-05-11) CLOSED appended to the
status table at the bottom of cowork/auth-bundles-audit-2026-05-10.md.
Operator-visible advisory in CHANGELOG.md v2.1.0 release notes under
Security (non-BREAKING — legacy callers are unchanged).

Depends on Fix 01 (the scope-aware EffectivePermissions read path on
branch fix/audit-2026-05-11/crit-actor-role-scope-reads). This fix
makes the inverse op selectively reversible; without Fix 01 the read
side would mis-evaluate scoped grants anyway, making selective revoke
moot at runtime.
2026-05-11 10:50:34 +00:00
shankar0123 874419989d harden(auth/cookies): __Host- prefix on all three auth cookies (MED-14, BREAKING)
Audit 2026-05-10 — close MED-14 from the HANDOFF.md backend batch
(item 5). The session, CSRF, and OIDC pre-login cookies all carry
the __Host- prefix; browsers now reject any subdomain attempt to
overwrite them.

Cookie name changes (BREAKING — existing sessions invalidate):
  - certctl_session       → __Host-certctl_session
  - certctl_csrf          → __Host-certctl_csrf
  - certctl_oidc_pending  → __Host-certctl_oidc_pending

The __Host- prefix requires Path=/ + Secure + no Domain attribute.
Post-login session + CSRF cookies already met all three. The pre-login
cookie's Path widened from '/auth/oidc/' to '/' to satisfy the prefix;
the cookie lives 10 minutes and is only consumed by the callback
handler, so the wider path scope is harmless.

Files touched:
  - internal/auth/session/domain/types.go — constant rename + comment
  - internal/auth/session/domain/types_test.go — assertion update
  - internal/api/handler/auth_session_oidc.go — pre-login set + clear
    paths widened from /auth/oidc/ to /
  - web/src/api/client.ts — readCSRFCookie now compares against
    '__Host-certctl_csrf'
  - CHANGELOG.md — Unreleased > Security (BREAKING) entry
  - docs/migration/oidc-enable.md — operator-facing detail of the
    one-time re-authentication window + GUI customization guidance

Operator impact: ONE re-login prompt per active session at the deploy
that lands this change. Subsequent logins issue the __Host-prefixed
cookie automatically. Existing bookmarked deep links work without
modification (cookies are path-scoped, not URL-scoped).

Refs: cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 5
      cowork/auth-bundles-audit-2026-05-10.md MED-14
2026-05-10 22:52:53 +00:00
shankar0123 9cce2ab043 harden(auth): LOW + Nit batch — bootstrap audit, crypto/rand, XFF trust, CSRF check, protocol-prefix unify (Batch 1)
Audit 2026-05-10 — close 8 LOWs + 2 Nits in-bundle. Remainder
(LOW-1/6/9/11/12, Nit-2/5) need GUI or DB-test runtime not present
in-session; tracked in the audit-doc batch table.

LOW-2: bootstrap.ValidateAndMint now emits 'bootstrap.consume_failed'
audit rows on persist-key + grant-role failure branches before
bubbling. Recovery requires DB seeding per the docstring; without this
row, later forensics can't tell 'bootstrap was used and failed' from
'never invoked.'

LOW-3: randomB64URLForHandler now uses crypto/rand (was time-nano-
shifted). Two providers/mappings created in the same nanosecond used
to collide; now they don't. Time-nano fallback retained for the
unlikely crypto/rand-broken path.

LOW-4: breakglass.verifyDummy uses s.readRand(salt) for the dummy
Argon2id verify. Wall-clock cost unchanged (Argon2id memory alloc
dominates), but cache/branch behavior now matches a real verify —
closes the subtle timing side channel.

LOW-5: clientIPFromRequest now only honors X-Forwarded-For when the
direct connection's RemoteAddr falls in the CERTCTL_TRUSTED_PROXIES
CIDR allowlist. Default-deny: empty list means XFF is ignored.
SetTrustedProxies wired in cmd/server/main.go from cfg.Auth.TrustedProxies.

LOW-7: internal/auth/protocol_endpoints.go::ProtocolEndpointPrefixes
now carries /scep-mtls + /.well-known/est-mtls (previously only in
router.AuthExemptDispatchPrefixes; the two lists had drifted). The
canonical-prefix coverage test in Phase 12 still pins the set.

LOW-8: docs/operator/rbac.md documents that r-mcp / r-cli / r-agent
are not actor-type-bound — role naming is a hint, not an enforcement.
Operators wanting hard binding must apply periodic audit queries.
Native binding is on the v2 roadmap.

LOW-10: Session.Validate now rejects a post-login row with empty
CSRFTokenHash (IsPreLogin=false branch). validSession test fixture
updated with a valid 64-hex CSRF hash.

Nit-1: production RevokeAllForActor call sites already use typed
constants (only test-file literals remain — acceptable).

Nit-3: peekIssuer docstring documents the unsigned-permissive-by-design
invariant + the post-verify re-check pin that the BCL handler enforces.
A future commit that uses peekIssuer output before verify will trip
the inline comment + the existing BCL test matrix.

Status table updated in cowork/auth-bundles-audit-2026-05-10.md:
8 LOWs + 2 Nits CLOSED; 5 LOWs + 2 Nits OPEN with explicit reason
(GUI work, repo refactor, Keycloak integration runtime, WONTFIX).

Refs: cowork/auth-bundles-audit-2026-05-10.md LOW-2/3/4/5/7/8/10
      cowork/auth-bundles-audit-2026-05-10.md Nit-1/3
2026-05-10 22:26:12 +00:00
shankar0123 68ca42fef1 fix(auth): apply rbacGate to every state-changing + read handler (CRIT-1 closure)
Closes the wire-layer authorization gap surfaced by the 2026-05-10 audit
(CRIT-1). Before this commit only ~24 of ~140 routes carried rbacGate
enforcement — all of them admin-only fine-grained perms (auth.session.*,
auth.oidc.*, auth.breakglass.admin, cert.bulk_revoke, crl.admin, scep.admin,
est.admin, ca.hierarchy.manage). Every catalogued legacy-CRUD perm
(cert.read/issue/revoke/delete, profile.edit/delete, issuer.edit/delete,
target.*, agent.*, plus role-mgmt verbs) was declared in
internal/domain/auth/validate.go but never wired at the router. A r-viewer
Bearer was essentially r-admin minus five verbs at the wire layer (CWE-862).

This commit:

- Adds rbacGateScoped(checker, perm, scopeType, scopeFn, h) helper to
  internal/api/router/router.go for path-bound scope resolution. Per-profile
  and per-issuer grants (Decision 2) now reach the wire layer.
- Wraps every state-changing route AND every read endpoint in router.go
  with rbacGate (global) or rbacGateScoped (path-bound). The auth-management
  routes (POST /api/v1/auth/roles, etc.) gain router-level enforcement
  in addition to the existing service-layer Authorizer check — defense in
  depth (HIGH-9 of the same audit collapses into this closure).
- Auth-exempt surfaces stay un-gated by design: login, callback, BCL,
  logout, breakglass-login, bootstrap, health, auth-info, version. Allowlist
  is documented in TestRouterRBACGateCoverage.
- Extends internal/domain/auth/validate.go CanonicalPermissions with 30 new
  perms across 12 namespaces: cert.edit; job.read, job.cancel; approval.read,
  approval.approve, approval.reject; policy.read/edit/delete;
  team.read/edit/delete; owner.read/edit/delete; notification.read/edit;
  discovery.read/run/claim; network_scan.read/edit/run;
  healthcheck.read/edit/delete/acknowledge; digest.read, digest.send;
  verification.read, verification.run; stats.read; metrics.read.
- Updates DefaultRoles for r-admin / r-operator / r-viewer / r-mcp / r-cli /
  r-agent. r-auditor gets NOTHING new — the auditor pin
  (TestAuditorRoleHoldsExactlyAuditReadAndExport) stays invariant.
- Migration 000039_audit_crit1_perms seeds the new perm rows + role grants
  per the updated DefaultRoles map. Idempotent ON CONFLICT DO NOTHING.
  Reverse migration removes role_permissions before permissions
  (ON DELETE RESTRICT on the FK).
- AST-level CI guard TestRouterRBACGateCoverage in
  internal/api/router/router_rbac_coverage_test.go walks router.go and
  asserts every state-changing + read route is wrapped (or in the
  documented allowlist). Adding a new ungated route fails CI.
- Updates docs/operator/rbac.md permission-catalogue table with the new
  namespaces + footer link to the AST CI guard.
- Updates certctl/CHANGELOG.md v2.1.0 section with the closure narrative.

Audit doc cowork/auth-bundles-audit-2026-05-10.md CRIT-1 row annotated
CLOSED 2026-05-10. Bundle's exit-gate spec lives at
cowork/auth-bundles-fixes-2026-05-10/01-crit-1-rbac-gates.md.

CRIT-2 / CRIT-3 / CRIT-4 / CRIT-5 of the same audit remain open and
continue to block the v2.1.0 tag.

Verification gate green:
- gofmt -d (no diff after gofmt -w on the touched files)
- go vet ./...
- go test -short -count=1 ./...   (all packages pass including auditor pin)
- go build ./...

HIGH-9 of the audit closes via this commit's router-layer rbacGate on
POST /api/v1/auth/keys/{id}/roles + DELETE /api/v1/auth/keys/{id}/roles/{role_id}
(defense-in-depth on top of the existing service-layer privilege check).

Refs: cowork/auth-bundles-audit-2026-05-10.md CRIT-1 HIGH-9
2026-05-10 19:58:26 +00:00
shankar0123 c03d18bb1c auth-bundle-2 Phase 16: docs updates (security.md OIDC + sessions + break-glass + auditor split sections; new migration/oidc-enable.md; CHANGELOG.md v2.1.0 Bundle 2 release notes)
Closes Phase 16 of cowork/auth-bundle-2-prompt.md. Three operator-
facing docs updated, one new migration guide ships, README nav row
added.

Files
=====

docs/operator/security.md (MODIFIED, Last reviewed bumped to 2026-05-10):
* Added 5 new Bundle 2 subsections under '## Authentication
  surface' after the Bundle 1 approval-bypass-closure entry:
  - 'OIDC federation (Bundle 2 Phases 1-7)' — alg allow-list,
    IdP-downgrade defense, iss/aud/azp/at_hash, single-use
    state+nonce, PKCE-S256 mandatory, JWKS rotation handling,
    encrypted client_secret at rest with the v3 blob format
    pinned by an integration test, pointer to oidc-runbooks/
    for per-IdP setup.
  - 'Sessions + back-channel logout (Bundle 2 Phases 4-6)' —
    length-prefixed HMAC cookie wire format, HttpOnly + Secure
    + SameSite cookie hardening, idle/absolute timeouts, CSRF
    defense, signing-key rotation primitive, fail-fatal
    EnsureInitialSigningKey at server boot, OpenID Connect
    Back-Channel Logout 1.0 (NOT RFC 8414).
  - 'OIDC first-admin bootstrap (Bundle 2 Phase 7)' — coexists
    with Bundle 1's env-var-token bootstrap, group-scoped via
    CERTCTL_BOOTSTRAP_ADMIN_GROUPS + CERTCTL_BOOTSTRAP_OIDC_PROVIDER_ID,
    one-shot per tenant.
  - 'Break-glass admin (Bundle 2 Phase 7.5)' — default-OFF,
    surface invisibility via 404-not-403, Argon2id with OWASP
    2024 params, lockout state machine, constant-time-via-
    verifyDummy, WARN log at boot, runbook pointer for
    operator drill.
  - 'Migrating an existing deployment to OIDC' — pointer to
    the new migration/oidc-enable.md walkthrough.

docs/migration/oidc-enable.md (NEW, Last reviewed 2026-05-10):
* Step-by-step migration guide for an operator on a Bundle-1-merged
  deployment to enable OIDC SSO. Pre-reqs (CERTCTL_CONFIG_ENCRYPTION_KEY,
  admin actor with auth.oidc.create + auth.oidc.edit, IdP tenant)
  + 7 numbered steps (pin encryption key, complete IdP-side per
  runbook, configure certctl-side OIDCProvider, add group→role
  mappings with fail-closed warning, optional first-admin bootstrap,
  verify with single test user, announce SSO endpoint).
* Rollback section covering the 4-step disable flow + the 409
  Conflict on provider-delete-while-sessions-exist + the
  existing-sessions-keep-working-until-expiry semantics.
* Troubleshooting section pinning 8 most-common failure modes
  (discovery doc fetch fails / IdP downgrade defense rejects /
  no roles assigned / iss mismatch / pre-login expired / state
  mismatch / sessions revoked but user can hit API / JWKS
  rotation breaks login).
* Database row count drift documented so operators know what to
  expect after OIDC is live (10 Bundle 2 tables enumerated).
* Cross-references to oidc-runbooks/ + security.md +
  auth-threat-model.md + auth-benchmarks.md + auth-standards-implemented.md.

CHANGELOG.md (MODIFIED):
* v2.1.0 section title bumped from 'Auth Bundle 1: RBAC primitive'
  to 'Auth Bundles 1 + 2: RBAC primitive + OIDC SSO + sessions'.
* Replaced the Bundle 1 closing-bullet ('Bundle 2 starts after
  Bundle 1 lands on master') with 18 new Bundle 2 entries:
  - OIDC + sessions + back-channel logout + break-glass overview.
  - OIDC token validation pinned at three layers (alg allow-list,
    IdP-downgrade defense, OIDC Core §3.1.3.7 re-verification).
  - Length-prefixed HMAC session cookies.
  - CSRF double-submit + hashed-token-on-row.
  - OIDC client_secret AES-256-GCM v3 blob at rest +
    integration-test invariant.
  - OIDC first-admin bootstrap.
  - Default-OFF break-glass admin (Argon2id + lockout +
    constant-time + surface invisibility).
  - GUI: 4 new pages + login-page IdP buttons + sidebar logout.
  - 11 new MCP tools for OIDC + session management.
  - 6 per-IdP runbooks (Keycloak / Authentik / Okta / Auth0 /
    Entra ID / Google Workspace).
  - Threat model extended with 5 new defense subsections + 8 new
    threat-catalogue subsections.
  - Performance baselines documented (4 benchmarks; 3 measured
    + 1 operator-runs).
  - Standards-and-RFC implementation table (13 RFCs + 14 CWEs;
    NOT a compliance-mapping doc).
  - Coverage gates held at floor 90 across all 4 Bundle 2
    packages (anti-Bundle-1-mistake invariant).
  - Multi-tenant query CI guard (ratchet baseline 32).
  - Phase 10 Keycloak testcontainers integration test + optional
    Okta smoke test.
  - OpenAPI cookieAuth security scheme + 13 new endpoints + 4
    break-glass endpoints.
  - Bundle-1-only compat regression CI guard +
    Bundle-1-to-2-upgrade regression CI guard.
* Final paragraph updated to point at oidc-enable.md alongside
  api-keys-to-rbac.md as the two migration walkthroughs.

docs/README.md (MODIFIED):
* Added the new oidc-enable.md migration row under '## Migration'
  alongside the existing api-keys-to-rbac.md entry, with a
  one-line description flagging it as the Bundle 2 OIDC
  onboarding walkthrough.

Verification
============

* Last-reviewed on security.md + oidc-enable.md: 2026-05-10.
* Internal-link sweep on oidc-enable.md: 0 broken (every relative
  link resolves via shell-loop verification).
* Internal-link sweep on docs/README.md: 0 broken (all .md
  references resolve).
* No Go-side impact, make verify gate unchanged.

Bundle 2 documentation deliverables now complete: security.md +
auth-threat-model.md + oidc-runbooks/ + auth-benchmarks.md +
auth-standards-implemented.md + api-keys-to-rbac.md + oidc-enable.md
+ CHANGELOG.md v2.1.0. The full Bundle 2 surface is operator-
discoverable from docs/README.md root nav.
2026-05-10 17:07:27 +00:00
shankar0123 3f335af45e auth-bundle-2 Phase 15: docs/reference/auth-standards-implemented.md (RFC + CWE evidence list, NOT a compliance-mapping doc)
Closes Phase 15 of cowork/auth-bundle-2-prompt.md. Ships a single
operator-facing doc that lists every RFC the auth bundles implement
and every CWE class the implementation closes, with concrete file
paths + test anchors per row.

Files
=====

docs/reference/auth-standards-implemented.md (NEW):
* Table 1: 13 RFCs / standards rows (RFC 6749, 7636, 7519, 7517,
  OIDC Core 1.0, OIDC BCL 1.0, RFC 6265, RFC 9700, RFC 8414,
  RFC 7633, RFC 8555, RFC 7515 plus the OIDC Core §5.3.2 UserInfo
  endpoint). Every row has a concrete source file path + a
  negative-test anchor.
* Table 2: 14 CWE rows (CWE-287, 352, 384, 294, 916/329, 307,
  345, 200, 770, 330, 311, 326, 1004, 614, 1275). Every row
  points at where the defense lives + where it is pinned.
* Bundle 1 RBAC standards covered separately at the end with
  CWE-285, 862, 863, 732 pointers into Bundle 1's surface.
* Explicit 'What this document is NOT' section preserving the
  operator's 2026-05-05 retired-compliance-docs decision: the
  doc is an evidence list, NOT a SOC 2 / PCI-DSS / HIPAA /
  NIST SP 800-53 / NIST SSDF / FedRAMP framework-mapping doc.
  Framework name-drops appear ONLY inside the explicit
  'this is NOT' disclaimer paragraphs; no marketing-flavored
  prose claims certctl 'satisfies CC6.1' or similar.

docs/README.md (MODIFIED):
* Adds the auth-standards-implemented.md doc to the Reference
  section nav table between intermediate-ca-hierarchy.md and
  the deployment-model.md entry, with a one-line description
  flagging it as RFC + CWE evidence (NOT a compliance-mapping
  doc).

Verification
============

* Last-reviewed header: 2026-05-10.
* Internal-link sweep: every relative link resolves cleanly.
* Framework-name grep: SOC 2 / PCI-DSS / HIPAA / NIST SSDF /
  FedRAMP appear ONLY inside the 'this is NOT a compliance-
  mapping doc' disclaimer paragraphs (lines 7 and 66 of the
  new doc). No marketing-flavored claims.
* No Go-side impact; pure docs commit, make verify gate
  unchanged.
2026-05-10 16:58:06 +00:00
shankar0123 9b6294e83d auth-bundle-2 Phase 14: session + OIDC validation benchmarks (steady-state + cold paths) + auth-benchmarks.md operator doc + Makefile targets
Closes Phase 14 of cowork/auth-bundle-2-prompt.md. Ships four
benchmarks producing four numbers + the operator-doc table; three
default-tag benchmarks runnable on every CI runner, the fourth
(cold-cache OIDC) runnable on operator-side Docker hosts via the
new make target.

Files
=====

internal/auth/session/bench_test.go (NEW):
* BenchmarkSession_SteadyState (target p99 < 1ms; measured 5µs).
  Warm in-memory repo + warm session row. Pure CPU: parseCookie +
  HMAC verify + map lookup + sentinel checks.
* BenchmarkSession_ColdProcess (target p99 < 10ms; measured 7.1ms).
  Same pipeline but with a configurable per-call delay simulating
  a 1ms Postgres RTT on each repo call. Two repo calls per
  Validate (signing-key fetch + session-row fetch) = 2ms minimum;
  Go time.Sleep granularity adds ~1-2ms jitter. Documented why
  testcontainers Postgres isn't viable inside b.N: 30+ second
  container boot incompatible with per-iteration timing.
* slowSessionRepo + slowKeyRepo wrappers add the per-call delay
  via time.Sleep; they delegate to the existing in-memory stubs.
* reportPercentiles helper sorts + reports p50/p95/p99/max via
  b.ReportMetric (Go testing.B doesn't surface percentiles
  natively).

internal/auth/oidc/bench_test.go (NEW):
* BenchmarkOIDC_SteadyState (target p99 < 5ms; measured 1.5ms).
  Drives full HandleCallback against an in-process mockIdP
  (httptest.Server localhost loopback). Pre-warmed JWKS cache via
  RefreshKeys at setup. Pipeline: pre-login consume + state
  compare + token exchange (localhost ~50-200µs) + go-oidc
  Verify (RSA-2048 sig verify + alg pin) + service-layer iss/
  aud/azp/at_hash/exp/iat/nonce re-checks + group-claim
  resolution + group→role mapping + user upsert + session mint.
* The localhost-loopback /token call adds ~100-500µs of TCP
  overhead vs pure crypto; the prompt's "no network calls"
  steady-state framing accommodates this since the localhost
  loopback is the closest practical proxy for a same-region
  IdP /token call (which adds 5-15ms in production).

internal/auth/oidc/bench_keycloak_test.go (NEW, //go:build integration):
* BenchmarkOIDC_ColdCache (target p99 < 200ms; operator-runs).
  Drives RefreshKeys against a live Keycloak container from the
  Phase 10 testfixtures harness. Each iteration evicts the
  in-process cache + re-fetches discovery + re-fetches JWKS over
  real HTTP + re-runs the IdP-downgrade-attack defense.
* Network-bounded: the cold path is dominated by HTTPS RTT to
  the IdP discovery endpoint, NOT crypto. The 200ms cap
  accommodates a geographically-distant IdP (~150ms RTT) plus
  the in-process JWKS fetch + downgrade-defense logic (~5ms
  locally).
* Reuses the sharedKeycloak fixture from
  integration_keycloak_test.go (Phase 10) so the benchmark
  doesn't pay the 60-90s container boot cost separately. Skips
  with a clear message if invoked without the integration test
  setup.
* Reports p50/p95/p99/max in MILLISECONDS (vs the
  microsecond-granularity steady-state benchmarks) since the
  cold path is two orders of magnitude slower.

internal/auth/oidc/service_test.go (MODIFIED):
* Refactored newMockIdP(t *testing.T) to delegate to a new
  newMockIdPWithTB(t testing.TB) sibling. Standard Go pattern
  for sharing test fixtures between *testing.T and *testing.B.
  No behavior change for existing service_test.go tests; the
  benchmark file in bench_test.go calls newMockIdPWithTB(b)
  to get the same fixture.

docs/operator/auth-benchmarks.md (NEW):
* Result table with all four benchmarks + targets + measured
  numbers + status markers. Four-row matrix for the default-tag
  benchmarks; the fourth row (cold-cache) is operator-recorded
  with an empty cell waiting for the first Docker-equipped run.
* Hardware floor section pinning the 4 vCPU / 8 GiB RAM /
  Postgres 16 / Go 1.25 baseline. GitHub-hosted Ubuntu runners
  satisfy this; operators on weaker hardware re-record.
* "What each benchmark covers (and what it doesn't)" section
  per benchmark, distinguishing the warm steady-state pipeline
  from the cold path's network-bounded budget.
* "Cold-cache OIDC: how to run" subsection documenting the
  make target + the test+benchmark coupling needed to populate
  sharedKeycloak. Operator-recorded baseline table seeded
  empty for first runs.
* "Why the cold path is bounded by network latency, not crypto"
  section explaining the budget breakdown:
    - TCP handshake (1 RTT)
    - TLS 1.3 handshake (1-2 RTTs)
    - 2 HTTPS GETs (discovery + JWKS, 1 RTT each)
    - In-process crypto on the certctl side (~5-10ms total)
  So the 200ms cap is operator-checkable: real measurement >
  200ms means the IdP is slow OR network congestion OR DNS
  issues — the diagnosis is upstream of certctl. Real
  measurement < 200ms means the IdP is on a fast same-region
  link.
* Methodology section pinning the per-iteration timing capture
  + sort + percentile-extract approach.
* Pre-merge audit section for the Phase 14 exit gate: four
  benchmarks ran, four numbers recorded, steady-state targets
  met, cold path is operator-runnable + measurably-bounded.

Makefile (MODIFIED):
* Added `make benchmark-auth` (default-tag, runs three of four
  benchmarks at 2000 samples each).
* Added `make benchmark-auth-coldcache` (integration-tagged,
  runs OIDC cold-cache against live Keycloak; requires Docker).
* Both targets carry explanatory comment blocks.

docs/README.md (MODIFIED):
* Added the auth-benchmarks.md doc to the Operator nav table
  alongside performance-baselines.md.

Measured baselines at Phase 14 close (linux/arm64, 4 vCPU)
==========================================================

  BenchmarkSession_SteadyState     p99 = 5µs    (target < 1ms)   ✓ 200× under
  BenchmarkSession_ColdProcess     p99 = 7.1ms  (target < 10ms)  ✓
  BenchmarkOIDC_SteadyState        p99 = 1.5ms  (target < 5ms)   ✓ 3× under
  BenchmarkOIDC_ColdCache          operator-runs (Docker required)

Verification
============

* gofmt -l on three new bench files: clean.
* go vet ./internal/auth/session/... ./internal/auth/oidc/...: clean
  (default tag).
* go vet -tags integration ./internal/auth/oidc/...: clean (integration
  tag covers the bench_keycloak_test.go file).
* go test -short -count=1 across all 5 OIDC + session packages:
  green; the bench_*_test.go files compile but don't run under
  -short (testing.Short() guards + benchmarks are not selected
  by -run pattern).
* All three runnable benchmarks executed and produce the numbers
  above; recorded in auth-benchmarks.md.
2026-05-10 16:51:28 +00:00
shankar0123 5e2accbf5f auth-bundle-2 Phase 12: extend auth-threat-model.md with Bundle 2 sections (OIDC + sessions + back-channel logout + OIDC first-admin + break-glass + 8 Bundle 2 threat sub-sections)
Closes Phase 12 of cowork/auth-bundle-2-prompt.md. The single
canonical operator-facing threat model (one doc per topic per the
docs convention) now covers both Bundle 1 (RBAC) AND Bundle 2 (OIDC
+ sessions + back-channel logout + OIDC first-admin + break-glass)
in one place.

File: docs/operator/auth-threat-model.md (MODIFIED, +485 LOC)

Conventions held
================

* The Bundle 1 sections ("Threat actors", "Defenses Bundle 1
  ships", "Threats Bundle 1 does NOT close", "Compliance mapping",
  "Operator-facing checks", "Cross-references") stay structurally
  intact. Bundle 2 EXTENDS them; nothing is rewritten in place.
* `Last reviewed:` header bumped 2026-05-09 → 2026-05-10.
* Per the prompt's explicit instruction: "do NOT create a separate
  auth-threat-model-bundle-2.md companion." This commit is a
  single-file extension.

Changes
=======

Intro paragraph rewritten:
* From "Bundle 1 lands... Bundle 2 will be updated" to "Bundle 1
  AND Bundle 2 land." Sets the reader's expectation that this is
  the post-Bundle-2 doc.

Threat actors section (4 new actors appended):
* OIDC-federated end user (token-forgery / session-hijacking /
  group-claim-manipulation surface).
* Stolen session cookie holder (XSS / network MITM / pasted-token).
* Compromised IdP (rogue token issuance; mitigations bounded to
  audit trail + group-mapping configuration).
* Break-glass-password holder (Phase 7.5 path bypasses OIDC + group
  layer entirely; default-OFF is the load-bearing mitigation).

NEW: Defenses Bundle 2 ships (5 sub-sections):
* OIDC token validation (Phase 3) — alg allow-list, IdP-downgrade
  defense, exact iss match, aud + azp checks, at_hash
  REQUIRED-when-access_token-present (Phase 3 tightening of OIDC
  core's MAY → MUST), single-use state + nonce, PKCE-S256 mandatory,
  iat window, JWKS rotation handling, JWKS-fetch-fail closed,
  encrypted client_secret at rest.
* Session minting + cookies (Phases 4 + 6) — length-prefixed HMAC
  defeating concatenation collision, HttpOnly + Secure + SameSite
  cookie hardening, idle + absolute timeouts, CSRF defense via
  double-submit-cookie + hashed-token-on-row, optional IP/UA bind,
  signing-key rotation primitive with retention window, fail-fatal
  EnsureInitialSigningKey at boot, pre-login vs post-login cookie
  discrimination.
* Back-channel logout (Phase 5) — OpenID Connect Back-Channel
  Logout 1.0 (NOT RFC 8414), required-claim pinning, jti-based
  replay defense, alg allow-list applies, Cache-Control: no-store.
* OIDC first-admin bootstrap (Phase 7) — coexists with Bundle 1's
  env-var-token bootstrap, group-scoped, one-shot per tenant via
  admin-existence probe, explicit OIDC provider gate, audit row on
  every grant.
* Break-glass admin (Phase 7.5) — default-OFF, surface-invisibility
  via 404-not-403, Argon2id with OWASP 2024 params, lockout state
  machine, constant-time across all failure paths via verifyDummy,
  WARN log at boot when ENABLED=true, 5/min rate limit on the
  public login endpoint.

NEW: Bundle 2 threat catalogue (8 sub-sections, one per
prompt-enumerated threat axis):

1. OIDC token forgery vectors and mitigations (9-row table covering
   alg confusion, audience injection, issuer mismatch, nonce replay,
   state replay, at_hash substitution, iat window manipulation,
   JWKS rotation mid-login, JWKS-fetch failure during a key
   rotation).
2. Session hijacking vectors and mitigations (7-row table covering
   XSS cookie theft, network MITM, CSRF, concatenation-collision
   forgery, stolen-cookie replay, cross-tab interference, sign-out
   race).
3. IdP compromise scenarios (operator monitors IdP audit logs,
   operator can rotate group-role mappings without redeploying,
   audit trail records source provider, provider-delete returns
   409 with active sessions).
4. Back-channel logout failure modes (6-row table covering IdP
   unreachable, invalid signature, replay via jti, alg confusion,
   missing events claim, present-nonce-claim).
5. Group-claim manipulation (4-row table covering operator
   misconfigured mapping, misconfigured groups_claim_path, IdP
   renames a group, IdP user maintainer adds user to unintended
   group).
6. Bootstrap phase risks post-Bundle-2 (4-row table covering
   CERTCTL_BOOTSTRAP_TOKEN leak, CERTCTL_BOOTSTRAP_ADMIN_GROUPS
   misconfigured to a wide group, both bootstrap strategies
   simultaneously, multi-IdP without explicit provider gate).
7. Break-glass risks (7-row table covering phished password,
   online brute-force, offline brute-force on DB compromise,
   operator forgets to disable, side-channel timing on
   wrong-vs-no-credential-vs-locked, surface fingerprinting,
   reserved-actor mutation).
8. Token-leak hygiene (the explicit grep policy with three
   per-package logging_test.go pointers + the audit_redact.go
   defense-in-depth note).

Threats Bundle 1 does NOT close section relabeled:
* Section header now reads "Threats Bundle 1 does NOT close
  (Bundle 2 closure status)" with each item carrying  / ⚠️ /
  "still deferred" markers.
* Items 1, 2, 3, 8 marked  closed by Bundle 2.
* Items 4, 5, 7, 9 marked still-deferred with v3 / follow-on
  pointers.
* Item 6 (rate limiting on bootstrap) marked acceptable; Bundle 2
  adds the same rate-limit primitive to /auth/breakglass/login.

NEW: Threats Bundle 2 does NOT close section listing the 8 v3 /
future-work items:
* WebAuthn / FIDO2 second factor (Decision 12).
* Time-bound role grants / JIT elevation.
* SAML federation (operators broker through Keycloak).
* Multi-tenant data isolation activation (gated to managed-service
  hosting work).
* HSM / FIPS-validated signing key for sessions.
* OIDC RP-initiated logout (Bundle 2 implements only back-channel).
* GUI E2E via Playwright.
* Per-IdP runbook external-tester sign-off (encouraged, NOT a merge
  gate post-2026-05-10 policy change).

Operator-facing checks section extended:
* 6 new SQL-shaped checks for Bundle 2 (provider count drift,
  per-actor session count, unmapped-groups audit-row spike,
  break-glass usage outside incidents, OIDC first-admin one-row-per-
  tenant invariant, retired-signing-key GC liveness).

Cross-references section split into Bundle 1 anchors + Bundle 2
anchors:
* Bundle 2 anchors enumerate every load-bearing file: 6
  internal/auth/ packages, 5 migrations, 3 ci-guards.

Compliance mapping section UNCHANGED:
* Phase 15 (standards-and-RFC-implementation table) is the proper
  home for the RFC + CWE evidence the Bundle 2 surface adds.
  Re-introducing framework-mapping prose at the threat-model layer
  would regress the operator's 2026-05-05 retired-compliance-docs
  decision, which is explicitly forbidden by the Phase 15 prompt.

Verification
============

* `> Last reviewed: 2026-05-10` — confirmed via head -3.
* All 8 prompt-mandated Bundle 2 threat sub-sections present —
  confirmed via grep `^### ` count (19 ### headers total: 6 Bundle
  1 + 5 Bundle 2 defenses + 8 Bundle 2 threats).
* All 39 prompt-listed threat-vector keywords present — confirmed
  via single-line grep counting 39 hits across the prompt's
  vocabulary.
* Internal markdown links resolve cleanly — confirmed via shell
  loop iterating each `]( ...)` reference and checking `[ -e "$path" ]`.
* No backend / Go-test impact — pure docs commit.
* `make verify` gate unchanged.
2026-05-10 16:11:08 +00:00
shankar0123 f203a5372d auth-bundle-2 Phase 11 follow-on: drop external-tester reference from oidc-runbooks/index.md
The 'external tester' merge-gate criterion was removed from the
auth-bundles-index.md policy: external-tester confirmations are
encouraged but NOT a merge condition (BSL discourages contribution-
style testing; the Phase 10 Keycloak testcontainers harness + the
optional Okta smoke test cover the same surface deterministically
in CI). Drops the now-stale phrasing from the runbooks index and
the merge-gate reference; keeps the operator-sign-off footer
recommendation since dated validation records are still useful.
2026-05-10 15:58:03 +00:00
shankar0123 2893f9b48e auth-bundle-2 Phase 11: 6 per-IdP OIDC runbooks + index + docs/README wiring
Closes Phase 11 of cowork/auth-bundle-2-prompt.md. Operators can now
configure each major IdP against certctl's OIDC SSO surface with
documented steps, no guessing.

Files
=====

docs/operator/oidc-runbooks/index.md (NEW):
* Index page linking all six per-IdP runbooks.
* Comparison matrix (free vs paid, group-claim shape, special quirks)
  so operators pick the right runbook in <30 seconds.
* "Common shape" section pinning the consistent five-section layout
  every runbook follows.
* "Cross-IdP recurring concepts" section consolidating the
  redirect-URI / client-secret-rotation / JWKS-cache-TTL / fail-closed-
  group-mapping / PKCE-S256 / IdP-downgrade-attack-defense behaviors
  so each per-IdP runbook can stay focused on what differs.

docs/operator/oidc-runbooks/keycloak.md (NEW):
* Canonical reference. Mirrors the testfixtures/keycloak-realm.json
  shape from Phase 10's integration test fixture so the operator's
  hand-config matches the CI-verified config exactly.
* Step-by-step IdP-side: realm → client → groups → group-mapper →
  user. Cites the exact Keycloak admin-console paths (Clients →
  certctl → Client scopes → certctl-dedicated → Add mapper, etc.).
* GUI + API + MCP equivalents for the certctl-side configuration.
* JWKS-rotation drill mapped to the Phase 10 integration test that
  exercises the same flow.
* 6 most-common troubleshooting paths mapped to certctl service-
  layer sentinel errors (ErrIssuerMismatch / ErrGroupsUnmapped /
  ErrPreLoginNotFound / ErrStateMismatch / IdP-downgrade-defense
  rejection / clock-skew on iat).

docs/operator/oidc-runbooks/authentik.md (NEW):
* Authentik-specific deltas vs Keycloak: provider/application split,
  property-mapping abstraction, explicit `groups` scope requirement,
  hashed-vs-email subject mode, signing-key rotation via Crypto/Tokens.

docs/operator/oidc-runbooks/okta.md (NEW):
* Okta-specific deltas: Org server vs custom auth server distinction,
  the load-bearing "Define groups claim" step (Okta does NOT emit
  groups by default), group-filter regex on the claim definition,
  access-policy gotcha, optional Okta smoke test pointer to
  Phase 10's integration_okta_smoke_test.go.

docs/operator/oidc-runbooks/auth0.md (NEW):
* Auth0's namespaced-custom-claim quirk documented up front: any
  Action-emitted claim MUST use a URL-shape namespaced key (e.g.
  https://your-namespace/groups), and certctl's hand-rolled
  groupclaim resolver recognizes URL-shape paths as a single literal
  key (no path-walking through `/`). Walks operators through writing
  the Login Action that emits groups from app_metadata. Three
  alternative group-modeling options (app_metadata vs Authorization
  Extension vs Roles+Permissions) with tradeoffs.

docs/operator/oidc-runbooks/azure-ad.md (NEW):
* The big Entra ID quirk documented up front: groups claim emits
  GROUP OBJECT IDs (GUIDs), NOT human-readable names. Certctl group→
  role mappings MUST be configured against the GUIDs. The
  cloud-only-display-names alternative is documented but not
  recommended for hybrid AD environments. Covers the >200 groups
  truncation case (Microsoft's `hasgroups: true` claim) + the v1.0
  vs v2.0 endpoint distinction (certctl supports v2.0 only).

docs/operator/oidc-runbooks/google-workspace.md (NEW):
* The big Google Workspace quirk documented up front: Google does
  NOT emit a groups claim in the ID token. Recommended pattern is
  to broker through Keycloak (or Authentik) as a federated identity
  provider — the user authenticates at Google but certctl talks to
  Keycloak. Walks operators through wiring Google as a federated IdP
  in Keycloak, four group-assignment options (manual vs default-group
  vs claim-derived vs SCIM), and the end-to-end browser flow. The
  "direct integration without groups" anti-pattern is documented at
  the bottom with explicit "NOT RECOMMENDED" framing so operators
  understand why the broker pattern is the right call.

docs/README.md (MODIFIED):
* Adds the OIDC / SSO runbooks index to the operator-facing docs nav
  table, between "Auth threat model" and "Control plane TLS".

Conventions held
================

* Every runbook carries `> Last reviewed: 2026-05-10` per the
  docs convention.
* Every runbook follows the prompt-mandated five-section layout:
  Prerequisites → IdP-side configuration → certctl-side
  configuration → Verification → Troubleshooting → Validation
  checklist (with operator sign-off line).
* Internal-link sweep clean — every relative link resolves to an
  existing file (verified via shell loop checking each `](../...)`
  and `](*.md)` reference). External links to IdP vendor sites are
  the canonical https URLs.
* No leakage of cowork/ workspace paths as Markdown links — the
  azure-ad.md initially had a `[auth-bundles-index.md](../../../../cowork/...)`
  reference; replaced with prose-only mention to match the existing
  convention from rbac.md + migration/api-keys-to-rbac.md.
* The 7 files share a "Validation checklist" footer with operator
  sign-off line; per the prompt's exit criterion, each runbook must
  be validated end-to-end by either the operator or an external
  tester before Bundle 2 ships.

Verification
============

* Last-reviewed dates: 7/7 runbooks dated 2026-05-10.
* Internal-link sweep: 0 broken (every `]( ...)` reference resolves).
* docs/README.md → operator/oidc-runbooks/index.md link resolves.
* No backend / frontend / Go-test impact — pure docs commit. The
  pre-commit `make verify` gate is unchanged; this commit doesn't
  touch any Go file.

Phase 11 deviation note
=======================

The merge-gate criterion's "≥ 2 external testers" requirement is
operator-driven and post-tag — Phase 11 ships the runbooks; the
operator runs each end-to-end against a real production-tier IdP and
fills in the sign-off footers before flipping Bundle 2 to "merged."
Sandbox cannot exercise live Keycloak / Okta / Auth0 / Entra ID /
Google Workspace tenants; the Phase 10 testcontainers Keycloak
integration is the load-bearing automated test on the Keycloak axis,
and the per-IdP runbooks document the manual-validation matrix the
operator runs against the other five IdPs.
2026-05-10 15:49:56 +00:00
shankar0123 3e91c7a1f0 chore(security): bump Go toolchain 1.25.9 -> 1.25.10 + golang.org/x/net 0.49 -> 0.53
CI run #484's Go Build & Test job failed govulncheck (M-024 hard
gate). Six standard-library CVEs land in go1.25.9 + one
golang.org/x/net CVE in v0.49.0; all are fixed in go1.25.10 + x/net
v0.53.0 respectively. The advisories that fired were:

  GO-2026-4986  Quadratic string concat in net/mail.consumeComment
                — called via internal/api/handler/validation.go's
                ValidateCommonName -> mail.ParseAddress
  GO-2026-4977  Quadratic string concat in net/mail.consumePhrase
                — same call site
  GO-2026-4982  Bypass of meta-content URL escaping in html/template
                — called via internal/service/digest.go's
                RenderDigestHTML -> Template.Execute
  GO-2026-4980  Escaper bypass in html/template
                — same call site
  GO-2026-4971  Panic in net.Dial / LookupPort on Windows NUL bytes
                — many call sites (email notifier, SSH connector,
                ACME validators, validation.ValidateSafeURL, ...)
  GO-2026-4918  Infinite loop in net/http2 transport on bad
                SETTINGS_MAX_FRAME_SIZE
                — called via internal/connector/target/f5.go's
                F5Client.Authenticate -> http.Client.Do

Bumps applied:

* `go.mod`: `go 1.25.9` -> `go 1.25.10`; `golang.org/x/net v0.49.0`
  -> `v0.53.0` (kept indirect — the upgrade is force-pulled by the
  module-version directive; transitive deps will pick the higher).
* `.github/workflows/{ci,codeql,release}.yml`: setup-go pin and the
  release.yml `GO_VERSION` env var bumped to 1.25.10. The
  security-deep-scan.yml workflow uses the major-minor `1.25` pin
  which auto-resolves to the latest 1.25.x and is unaffected.
* `Dockerfile` + `Dockerfile.agent`: `golang:1.25-alpine@sha256:5caa...`
  re-pinned to `golang:1.25.10-alpine@sha256:8d22e29d960bc50cd0...`
  (digest looked up against `registry-1.docker.io/v2/library/golang/
  manifests/1.25.10-alpine`; verified by the digest-validity ci-guard).
  The explicit `1.25.10-alpine` tag form replaces the moving
  `1.25-alpine` pin so the image-spec is reproducible end-to-end
  even without the digest reference.
* `deploy/test/f5-mock-icontrol/Dockerfile`: `golang:1.25.9-bookworm
  @sha256:1a14...` re-pinned to `golang:1.25.10-bookworm@sha256:
  e3a54b77385b4f8a31c1...` (looked up the same way).
* `deploy/test/f5-mock-icontrol/go.mod`: `go 1.25.9` -> `go 1.25.10`.
* `internal/api/handler/version.go` + `api/openapi.yaml`: the
  `runtime.Version()`-shape comment + OpenAPI `example: go1.25.9`
  bumped to keep doc/example freshness.
* `docs/contributor/ci-pipeline.md` + `docs/reference/connectors/
  iis.md`: doc-only `Go 1.25.9` -> `Go 1.25.10` references.

Verification done in-tree:

* All `scripts/ci-guards/*.sh` pass locally including
  `digest-validity.sh` (the new digests resolve cleanly against
  Docker Hub).
* `S-1-hardcoded-source-counts.sh` clean (the false-positive on
  "Bundle 1 migrations" was fixed in the prior commit).

Operator step required post-push (sandbox has no Go toolchain):

  cd certctl && go mod tidy

This regenerates go.sum's `golang.org/x/net v0.49.0` h1: lines into
v0.53.0 ones. CI's `go mod tidy && git diff --exit-code go.mod
go.sum` step will catch the drift if missed; in that case run the
command, commit, and push the go.sum-only delta.
2026-05-09 21:35:46 -04:00
shankar0123 51f55c5fc9 auth-bundle-1 fix: S-1 ci-guard false positive on "Bundle 1 migrations"
CI run #484 surfaced the regression in the Frontend Build job:

  ::error::S-1 regression: hardcoded source-count prose reappeared:
  docs/migration/api-keys-to-rbac.md:32:schema is already at the target
  version. The Bundle 1 migrations

The S-1 guard's regex (scripts/ci-guards/S-1-hardcoded-source-counts.sh)
catches `\b[0-9]+\s+migrations\b` to prevent stale "<N> migrations"
prose in docs/. The Bundle 1 migration-guide phrasing "The Bundle 1
migrations" tripped on the digit-1 in "Bundle 1" sitting next to the
word "migrations" — false positive, not a real source-count claim.

Rephrase to "Migrations that ship in the Bundle 1 slice of v2.1.0:"
which keeps the same operator meaning without the regex collision.
The guard now passes; full ci-guards loop runs clean locally.

Spotted via the operator's CI-failure paste post-Bundle-1 merge.
2026-05-10 01:18:16 +00:00
shankar0123 5313cd8492 auth-bundle-1 Phase 13 follow-up: em-dash sweep + broken-link fix
Self-audit on e7a94b6 flagged the prompt's 'zero em dashes'
discipline rule. The four new Phase 13 docs and the v2.1.0
CHANGELOG section had 97 em-dash hits between them; this commit
sweeps them all to ASCII hyphens.

Counts before -> after:
  docs/operator/rbac.md                  28 -> 0
  docs/operator/auth-threat-model.md     36 -> 0
  docs/migration/api-keys-to-rbac.md     16 -> 0
  docs/operator/security.md               8 -> 0
  docs/reference/profiles.md              3 -> 0
  CHANGELOG.md                            6 -> 0

Mechanical: ' - ' (spaced em dash) and bare em-dash both replaced
with spaced ASCII hyphen, then double-spaces collapsed. Markdown
list bullets ('^- ', '^  - ', '^    - ') verified intact across
all six files. Internal-link sweep also re-run.

Also fixes a pre-existing broken link the audit caught:
  docs/operator/security.md:70 referenced
  '../internal/crypto/encryption.go' which is a 1-level-up jump
  from docs/operator/, not the 2-level-up jump it actually needs
  ('../../internal/crypto/encryption.go'). Pre-Bundle-1 link rot;
  fixed in lockstep so the merge gate's docs validation passes
  cleanly.

Final state across the Phase-13 docs + CHANGELOG:
  - 0 em dashes
  - 0 broken internal links
  - Last-reviewed: 2026-05-09 header on every new doc

Bundle 1 documentation is now ready for the operator-side merge
gate review.
2026-05-10 00:15:30 +00:00
shankar0123 e7a94b6080 auth-bundle-1 Phase 13: docs (rbac.md + threat model + migration guide + security.md update)
Closes the last Phase before the Bundle 1 Exit gate. Operators
now have authoritative reference + threat model + migration guide
covering every behavior change Bundles 0-12 introduced.

# New docs

* docs/operator/rbac.md (340 lines) — operator how-to:
  - Mental model (actors / roles / permissions / scopes)
  - 7 default roles seeded by migration 000029 + the 5
    admin-only fine-grained perms seeded by 000030
  - Permission catalogue table by namespace
  - Scope semantics (global beats specific) + the Bundle-2
    deferral on scope_id FK enforcement
  - Granting / revoking access from GUI + CLI + HTTP API + MCP
  - The auditor pattern (audit-only, no resource read)
  - Day-0 bootstrap flow (CERTCTL_BOOTSTRAP_TOKEN → curl →
    HTTP 410 thereafter)
  - Demo-mode (CERTCTL_AUTH_TYPE=none) caveat for production

* docs/operator/auth-threat-model.md (180 lines) — what the
  controls defend against:
  - 5 threat actors (external, wrong-role, compromised key,
    insider operator, compromised auditor)
  - Per-defense walk-through (API-key auth, RBAC, bootstrap,
    approval workflow + Phase 9 closure, audit trail,
    protocol-endpoint allowlist)
  - 9 explicit deferrals (OIDC, sessions, local accounts,
    JIT elevation, MFA, etc.) — Bundle 2 / future scope
  - Compliance mapping (SOC 2 CC6.1/CC6.3, HIPAA §164.312(b),
    NIST SSDF PO.5.2, FedRAMP AU-9, PCI-DSS §10)
  - 5 operator-runnable sanity checks (e.g.,
    'SELECT FROM audit_events WHERE actor=system-bypass' MUST
    return 0 in production)

* docs/migration/api-keys-to-rbac.md (200 lines) — v2.0.x →
  v2.1.0 upgrade flow:
  - The SECURITY: AUDIT YOUR API KEYS callout
  - Migration list (000029-000033) + what each does
  - 4-mode scope-down flow (interactive / non-interactive
    JSON / --suggest / --suggest --apply)
  - What changes for code that called auth.IsAdmin
  - Helm-specific upgrade flow with example post-upgrade Job
  - Docker Compose upgrade flow + the 5 examples folders
    that ride demo mode unchanged
  - Verification queries + rollback flow

# Updated docs

* docs/operator/security.md — Last-reviewed bumped to
  2026-05-09; existing Authentication-surface section
  extended to call out the Bundle 1 RBAC primitive,
  day-0 bootstrap path, and approval-bypass closure with
  cross-references to the new docs.

* docs/reference/profiles.md — Last-reviewed header
  formatting fixed (added the > blockquote prefix used
  consistently across the docs tree).

# docs/README.md navigation

* Operator section gains 2 new rows (RBAC + auth-threat-model)
  and Approval-workflow row updated to mention Phase 9
  closure.
* Reference section gains the Profiles row.
* Migration section gains the api-keys-to-rbac row with the
  AUDIT YOUR API KEYS callout in the link description.

# CHANGELOG.md v2.1.0 section refreshed

The Phase 7 commit landed the SECURITY: AUDIT YOUR API KEYS
callout. This commit appends the missing Phase 9-12 highlights:

  - Approval-bypass closure (profile-edit gate + flip-flop
    loophole + ErrApproveBySameActor invariant)
  - GUI: Roles / API Keys / Auth Settings / Approvals queue
  - 12 new MCP RBAC tools
  - Coverage gates on internal/auth + internal/service/auth
  - Protocol-endpoint allowlist pinned at 3 layers

Trailing cross-reference block now points at all 4 new docs.

# Verifications

* Every internal link in the 4 new/modified docs validated by
  shell sweep (find broken links → 0 hits).
* Every new doc carries 'Last reviewed: 2026-05-09' header
  with the > blockquote prefix matching the docs-tree
  convention.
* go vet ./... clean.
* staticcheck across every Bundle-1-touched Go package clean.
* gofmt -l clean repo-wide.
* go test -short -count=1 green across internal/auth (incl.
  bootstrap), internal/api/handler, internal/api/router,
  internal/cli, internal/service (incl. auth),
  internal/domain/auth, internal/mcp, cmd/cli (cmd/server
  has 1 environmental failure on the sandbox virtiofs-tmp:
  TestPreflightSCEPRACertKey_KeyWorldReadable_Refuses depends
  on tmpfs file-mode semantics that virtiofs propagates
  differently — pre-existing, unrelated to Bundle 1).
* Frontend: 19 Vitest tests across src/pages/auth/ +
  AuditPage all pass; tsc --noEmit clean.
2026-05-10 00:10:15 +00:00
shankar0123 69a508dfcf auth-bundle-1 Phase 9 + 10: approval-bypass closure + RBAC GUI
# Phase 9 — approval-bypass closure (Decision 9, option a)

* Migration 000033_approval_kinds.up.sql: ALTER TABLE
  issuance_approval_requests ADD COLUMN approval_kind +
  payload JSONB; relax certificate_id + job_id to nullable;
  CHECK (approval_kind IN ('cert_issuance','profile_edit'))
  + CHECK (per-kind nullability invariant) + index on
  approval_kind. Idempotent throughout via DO blocks.
* domain.ApprovalKind enum (cert_issuance / profile_edit) +
  IsValidApprovalKind. ApprovalRequest gains Kind +
  Payload []byte for the pending profile diff.
* postgres.ApprovalRepository.Create + scanApprovalRow extended
  to round-trip the new columns; certificate_id + job_id
  switched to sql.NullString so profile_edit rows persist
  cleanly. Default Kind=cert_issuance preserves back-compat
  for every Phase-7-2026-05-03 caller.
* ApprovalService.RequestProfileEditApproval: new entry point
  that creates a pending profile-edit row carrying the
  serialized profile diff. Bypass mode (CERTCTL_APPROVAL_BYPASS)
  short-circuits the same way it does for cert_issuance.
* ApprovalService.SetProfileEditApply hook: cmd/server/main.go
  registers a closure that deserializes req.Payload + persists
  via profileRepo.Update + emits a profile.edit_applied audit
  row with category=auth. The hook avoids the Approval ↔
  Profile import cycle.
* ProfileService.UpdateProfile: gates when (a) the live
  profile carries RequiresApproval=true, OR (b) the proposed
  edit would set it true. Returns ErrProfileEditPendingApproval
  with the new approval ID; ProfileHandler maps to HTTP 202
  Accepted + {pending_approval_id}. Both arms close the
  flip-flop loophole because every transition through an
  approval-tier profile fires the gate.
* TestProfileEdit_RequiresApprovalLoopholeClosed pins all 3
  bypass attempts (flip-off / kept-on / flip-on) gated; nil-
  approval-service preserves pre-Phase-9 direct-apply for
  test fixtures.
* Approval service tests gain 4 profile_edit rows: pending row
  shape; same-actor self-approve rejected with
  ErrApproveBySameActor (load-bearing two-person integrity);
  approve fails-closed when apply callback unwired;
  apply callback invoked on approve.
* docs/reference/profiles.md (new) explains the gate +
  edit response shape (202) + same-actor invariant + bypass
  + audit hooks.

# Phase 10 — RBAC management GUI

* useAuthMe hook (web/src/hooks/useAuthMe.ts): TanStack Query
  fetches /api/v1/auth/me on app boot, caches for 60s, exposes
  hasPerm(p) + hasAnyPerm + isAdmin predicates. Every Phase-10
  page consumes this on mount + gates affordances against the
  cached effective_permissions slice. Server-side enforcement
  is the load-bearing gate; client-side hide/disable is UX.
* New routes:
   - /auth/roles — list (auth.role.list); create-role modal
     (auth.role.create) hidden when missing.
   - /auth/roles/:id — detail + permissions; edit
     (auth.role.edit), delete (auth.role.delete), add/remove
     permission affordances each gated.
   - /auth/keys — list of every actor with role grants; assign
     + revoke modals (auth.role.assign). actor-demo-anon
     flagged system-managed; mutation buttons hidden for it.
   - /auth/settings — stub showing /v1/auth/me identity +
     bootstrap-endpoint availability via /v1/auth/bootstrap.
* AuditPage extended with category filter ('All categories'
  + the 3 enum values from migration 000032). Selection flows
  to the API call params + the URL-driven query state.
* Layout: 3 new nav entries (Roles / API Keys / Auth Settings).
* api/client.ts: 12 new exported functions for the RBAC
  surface (authMe, list/get/create/update/delete role,
  list/add/remove role permissions, list keys, assign/revoke
  key role, bootstrap-availability probe).
* data-testid attributes on every interactive element so a
  future Playwright suite can assert behavior without brittle
  CSS selectors.
* Empty state, error state, and unsaved-changes warnings on
  every form per the prompt's implementation rules.

# Frontend tests

* RolesPage.test.tsx (6 tests): list render, empty state,
  error state, hide-create-button-without-perm,
  show-create-button-with-perm, submit-create-modal.
* KeysPage.test.tsx (3 tests): demo-anon flagged
  system-managed (no buttons), permission-gated affordance
  hide for auditor caller, assign-modal-POST contract.
* AuthSettingsPage.test.tsx (2 tests): identity surface,
  bootstrap-OPEN-status surface.
* AuditPage.test.tsx (+1): category-filter select renders
  with the 4 documented options.

15 frontend tests total in src/pages/auth/ + the audit
category-filter test; all pass via npx vitest run.

# Verifications

* go vet ./... clean.
* staticcheck across internal/auth + handler + router + cli +
  service + repository + cmd + domain: clean.
* gofmt -l clean repo-wide.
* go test -short -count=1 green across internal/service,
  internal/api/handler, internal/api/router, internal/auth,
  internal/auth/bootstrap, internal/service/auth,
  internal/domain/auth, cmd/server, cmd/cli, internal/cli.
* npx tsc --noEmit clean.
* npm run build green (vite build produces dist/index.html
  + 946KB JS bundle; chunk-size warning is pre-existing).
* npx vitest run src/pages/auth/ src/pages/AuditPage.test.tsx
  green (15 tests, 4 files).
2026-05-09 21:03:59 +00:00
shankar0123 71ebccb8ba docs: fix broken ../examples/ links across docs/ (closes #11)
Reporter (thesudoer0003) flagged that the example links in
docs/getting-started/examples.md resolve to /docs/examples/ which
does not exist. Same bug pattern shows up in four other docs files.

The example READMEs live at examples/<name>/<name>.md at the repo
root, not under docs/. The references in the docs/ tree used
relative paths like `../examples/acme-nginx/acme-nginx.md` which
resolve to docs/examples/... — one level short of escaping out to
the repo root. Fix is one extra `../` so the path resolves to
examples/... at repo root, where the files actually live.

Files touched:
  docs/getting-started/examples.md           5 links
  docs/getting-started/why-certctl.md        1 link
  docs/migration/cert-manager-coexistence.md 1 link
  docs/migration/from-acmesh.md              1 link
  docs/migration/from-certbot.md             1 link

Verified: every `../../examples/<name>/<name>.md` reference now
resolves to the on-disk file. Re-checked via:

  for f in $(grep -rl 'examples/' docs/); do
    for link in $(grep -oE '\.\./\.\./examples/[^)]*' "$f"); do
      [ -e "$(dirname "$f")/$link" ] || echo "STILL BROKEN: $f -> $link"
    done
  done

zero "STILL BROKEN" output.

Closes #11
2026-05-06 20:30:32 +00:00
shankar0123 1720e11109 docs: fix broken single-file demo invocation in README + qa-prerequisites + ENVIRONMENTS
The README's Quick Start, the qa-prerequisites contributor doc, and the
landing page (separate repo, separate commit) all shipped a copy-paste
command that produces:

  service "certctl-server" has neither an image nor a build context
  specified: invalid compose project

The bug landed silently with commit a3d8b9c (the U-3 master). Pre-U-3,
docker-compose.demo.yml was self-contained and could be invoked with a
single -f flag. U-3 deliberately reduced it to a 27-line overlay — its
only payload today is `CERTCTL_DEMO_SEED=true` on the certctl-server
service — because the demo seed now applies at boot via
postgres.RunDemoSeed, not via /docker-entrypoint-initdb.d/. The overlay
no longer carries an image: or build: of its own, so it MUST be passed
alongside the base file.

The README/qa-doc/landing-page never picked up the rename of the contract.
Every operator who copy-pasted the Quick Start since U-3 has hit the
"invalid compose project" error and bounced. The operator caught it
running the demo locally today.

This commit fixes the three certctl-repo sites:

  README.md (Quick Start)
    docker compose -f deploy/docker-compose.demo.yml up -d --build
    →
    docker compose -f deploy/docker-compose.yml -f deploy/docker-compose.demo.yml up -d --build

    Plus the "drop the -f flag for clean install" prose now spells out
    the correct fallback (`-f deploy/docker-compose.yml` alone).

  docs/contributor/qa-prerequisites.md (Step 1)
    Same single-file → two-file fix, plus an inline note explaining
    why the override-only file requires the base (so the next person
    who reads it understands the contract instead of re-discovering it).

  deploy/ENVIRONMENTS.md (Demo Overlay → What it adds)
    Replaced the stale "One line: mounts seed_demo.sql into PostgreSQL's
    init directory" claim — that hasn't been true since U-3 — with the
    accurate "One env var: CERTCTL_DEMO_SEED=true; server applies
    seed_demo.sql at boot via postgres.RunDemoSeed" description, plus
    the historical context for why the overlay can't stand alone.

The certctl.io landing page hits the same bug (line 759); fix shipping
in a separate commit in that repo.

Acceptance gate (manual):
  - copy/paste the new README Quick Start command end-to-end against
    a fresh clone — succeeds, dashboard at https://localhost:8443
    shows the seeded demo data within ~30s.
  - clean-install fallback (`docker compose -f deploy/docker-compose.yml
    up -d --build`) starts a working stack with no demo data.
2026-05-05 20:55:26 +00:00
shankar0123 0e06f6c4fc cli: promote --force on renew + require --reason on revoke (closes P3-1, P3-2)
Closes findings P3-1 and P3-2 from the 2026-05-05 CLI/API/MCP↔GUI parity
audit (cowork/cli-gui-parity-audit-2026-05-05/RESULTS.md). Both findings
flagged hidden defaults that the CLI was sending without exposing them
to operators: `force=false` baked into every renew payload, and a silent
fallback to `reason="unspecified"` whenever --reason was omitted.

P3-1 — promote --force on `certs renew` (full end-to-end plumbing)

The pre-2026-05-05 CLI sent `{"force": false}` in the renew body. The
API handler never decoded it — a textbook "lying field" per the
operator's CLAUDE.md "complete path, not the easy path" rule: the body
field stored a value, claimed to do something, and silently did nothing
because the wire never reached the consumer. Adding a --force flag that
also went unread would have created another lying field.

This commit takes the complete path:

  service.CertificateService.TriggerRenewal grew a `force bool` parameter
  (internal/service/certificate.go). When force=true, the
  RenewalInProgress block is overridden so operators can recover stuck
  in-flight renewals where a previous job hung without releasing the
  status flag. Archived and Expired remain terminal blockers regardless
  of force — those are semantic dead-ends that --force should not paper
  over (archived = decommissioned, expired = issue a new cert instead of
  renewing a dead one).

  handler.CertificateHandler.TriggerRenewal parses force from
  ?force=true (or ?force=1) query param, OR {"force": true} JSON body,
  whichever the client picks. Defaults to false. Passes through to the
  service.

  internal/cli/client.go::RenewCertificate(id, force bool) sends
  ?force=true on the URL when --force is set. The historical hardcoded
  `{"force": false}` body is gone — no more lying field.

  cmd/cli/main.go dispatches `certs renew <id> [--force]` (ID-first
  flag-second convention matches the existing `agents retire <id>
  [--force]`).

P3-2 — require --reason on `certs revoke` (Option A: strict refusal)

The pre-2026-05-05 CLI dropped to `--reason unspecified` whenever the
operator omitted the flag. Compliance reporting (RFC 5280 §5.3.1, PCI-
DSS §3.6, HIPAA §164.312) relies on the reason code being meaningful;
silent fallback defeats the audit trail because every revocation looks
identical.

  cmd/cli/main.go dispatch refuses to send when --reason is empty,
  prints the canonical RFC 5280 §5.3.1 reason-code menu, and exits
  non-zero.

  internal/cli/client.go exposes ValidRevokeReasons() returning the
  canonical camelCase list (unspecified, keyCompromise, caCompromise,
  affiliationChanged, superseded, cessationOfOperation, certificateHold,
  removeFromCRL, privilegeWithdrawn, aaCompromise) and
  NormalizeRevokeReason() that accepts both camelCase and snake_case
  inputs and normalises to the canonical wire form. Off-list reasons
  are rejected at dispatch with the menu re-printed.

Test pins:

  internal/cli/client_test.go::TestClient_RenewCertificate_ForceFlag —
  --force=true sends ?force=true with empty body; --force=false sends
  no query and no body.

  internal/cli/client_test.go::TestNormalizeRevokeReason +
  TestValidRevokeReasons — canonical-camelCase + snake_case + reject-
  off-enum behaviour.

  cmd/cli/dispatch_test.go::TestHandleCerts_Revoke_RequiresReason +
  TestHandleCerts_Revoke_RejectsUnknownReason +
  TestHandleCerts_Renew_ForceFlag — dispatch-layer pins for the same
  contracts.

  internal/api/handler/certificate_handler_test.go::TestTriggerRenewal_
  ForceQueryParam — query-param passthrough (no-flag, force=true,
  force=1, force=false) flows through to the service-layer parameter.

  internal/service/certificate_test.go::TestTriggerRenewal_
  ForceOverridesInProgress — force=false preserves the
  RenewalInProgress block; force=true clears it.

  Existing TestTriggerRenewal_Archived extended to assert force=true
  still blocks Archived (terminal-state guarantee).

Docs: docs/reference/cli.md updated with the --force example for renew
and the strict --reason semantics for revoke (including snake_case
input acceptance).

Acceptance gate (verified):
  - go build ./cmd/server/... ./cmd/agent/... ./cmd/cli/...
    ./cmd/mcp-server/... clean.
  - go vet ./... clean.
  - go test -short -count=1 ./... pass repo-wide.
  - bash scripts/ci-guards/openapi-handler-parity.sh clean
    (router 178, OpenAPI 144, exceptions 36 — unchanged; we add
    parameter parsing, not routes).
  - gofmt -l clean.
2026-05-05 19:49:34 +00:00
shankar0123 ff75361553 mcp(coverage): add 34 tools across 7 domains to close 2026-05-05 parity audit P1 findings
Closes findings P1-1..P1-35 from the 2026-05-05 CLI/API/MCP↔GUI parity
audit (cowork/cli-gui-parity-audit-2026-05-05/RESULTS.md). Before this
bundle, 35 operator-facing API endpoints had GUI surfaces but no MCP
counterpart — operators using AI assistants for cert lifecycle work in
regulated environments had to drop to curl for approve/reject, health-check
acknowledgement, renewal-policy CRUD, network-scan triggering, discovery
triage, intermediate-CA management, and job verification.

Tool count: 87→121 in tools.go (+34), 6 unchanged in tools_est.go.
Re-derive via grep -cE 'gomcp\\.AddTool\\(' internal/mcp/tools.go
internal/mcp/tools_est.go.

The 7 phases (matching the bundle prompt at
cowork/mcp-coverage-expansion-prompt.md):

  Phase A — Approvals (P1-28..P1-31, 4 tools)
    list_approvals, get_approval, approve_request, reject_request.
    Two-person-integrity contract (ErrApproveBySameActor → HTTP 403)
    is preserved automatically: the decided_by actor is derived
    server-side from middleware.UserKey, NOT from request body, so
    the MCP server's authenticated API-key identity becomes the
    audit-trail actor. The MCP input schema deliberately omits any
    actor_id field to prevent client-side spoofing.

  Phase B — Health Checks (P1-20..P1-27, 8 tools)
    list, summary, get, create, update, delete, history, acknowledge.
    Mirrors the existing target-resource shape; acknowledge takes
    optional 'actor' string captured in the audit row (handler defaults
    to 'unknown' if absent).

  Phase C — Renewal Policies (P1-1..P1-5, 5 tools)
    Standard CRUD against /api/v1/renewal-policies. Distinct from the
    legacy 'policy' tools that point at the same path — these expose
    the renewal-policy domain explicitly with full alert_channels +
    alert_severity_map field shape.

  Phase D — Network Scan Targets (P1-14..P1-19, 6 tools)
    CRUD + trigger_scan. trigger_network_scan returns the discovery-
    scan body so the AI can chain into list_discovered_certificates
    filtered by agent_id.

  Phase E — Discovery read-side (P1-10..P1-13, 4 tools)
    list_discovered_certificates, get_discovered_certificate,
    list_discovery_scans, discovery_summary. Complements the
    pre-existing claim/dismiss tools (registered alongside Health
    historically per the I-2 closure).

  Phase F — Intermediate CAs (P1-6..P1-9, 4 tools)
    list, create (root + child via discriminator on body shape), get,
    retire. The handler is admin-gated via middleware.IsAdmin; the
    least-privilege boundary is enforced at the API layer (HTTP 403
    for non-admin Bearer callers) — not by transport carve-out.

  Phase G — Verification + deployments (P1-32, P1-34, P1-35, 3 tools)
    list_certificate_deployments, verify_job, get_job_verification.
    P1-33 (POST /api/v1/agents/{id}/discoveries) is intentionally
    excluded — machine-to-machine push channel for agents reporting
    filesystem-scan results, not an operator-driven flow. Documented
    inline in the RegisterTools dispatch.

Implementation:
  - 14 new input types in internal/mcp/types.go with jsonschema struct
    tags driving LLM tool discovery.
  - 7 register* functions in internal/mcp/tools.go each handling one
    phase, wired into RegisterTools dispatch in declaration order.
  - 34 new entries in tools_per_tool_test.go::allHappyPathCases —
    the existing in-process MCP harness (TestMCP_AllTools_HappyPath +
    TestMCP_AllTools_ErrorPath + TestMCP_RegisterTools_DispatchableToolCount)
    auto-extends coverage to cover every new tool: happy-path round-
    trip with fence-shape assertion, 5xx error-path with MCP_ERROR fence
    propagation, and 'every registered tool is dispatchable' guard.
  - docs/reference/mcp.md 'Available Tools' table expanded from 16 to
    22 resource domains with current per-domain tool counts.

Acceptance gate (verified):
  - go build ./cmd/server/... ./cmd/agent/... ./cmd/cli/... ./cmd/mcp-server/...
    clean across all four production binaries.
  - go vet ./... clean.
  - go test -short -count=1 ./internal/mcp/... pass (TestMCP_AllTools_*
    expanded to 127 tool round-trips).
  - go test -short -count=1 ./... pass repo-wide.
  - bash scripts/ci-guards/openapi-handler-parity.sh clean (router 178,
    OpenAPI 144, exceptions 36 — unchanged; we add MCP wrappers, not
    routes).
  - gofmt -l clean across the four touched files.
2026-05-05 19:29:57 +00:00