Commit Graph

141 Commits

Author SHA1 Message Date
shankar0123 3e91c7a1f0 chore(security): bump Go toolchain 1.25.9 -> 1.25.10 + golang.org/x/net 0.49 -> 0.53
CI run #484's Go Build & Test job failed govulncheck (M-024 hard
gate). Six standard-library CVEs land in go1.25.9 + one
golang.org/x/net CVE in v0.49.0; all are fixed in go1.25.10 + x/net
v0.53.0 respectively. The advisories that fired were:

  GO-2026-4986  Quadratic string concat in net/mail.consumeComment
                — called via internal/api/handler/validation.go's
                ValidateCommonName -> mail.ParseAddress
  GO-2026-4977  Quadratic string concat in net/mail.consumePhrase
                — same call site
  GO-2026-4982  Bypass of meta-content URL escaping in html/template
                — called via internal/service/digest.go's
                RenderDigestHTML -> Template.Execute
  GO-2026-4980  Escaper bypass in html/template
                — same call site
  GO-2026-4971  Panic in net.Dial / LookupPort on Windows NUL bytes
                — many call sites (email notifier, SSH connector,
                ACME validators, validation.ValidateSafeURL, ...)
  GO-2026-4918  Infinite loop in net/http2 transport on bad
                SETTINGS_MAX_FRAME_SIZE
                — called via internal/connector/target/f5.go's
                F5Client.Authenticate -> http.Client.Do

Bumps applied:

* `go.mod`: `go 1.25.9` -> `go 1.25.10`; `golang.org/x/net v0.49.0`
  -> `v0.53.0` (kept indirect — the upgrade is force-pulled by the
  module-version directive; transitive deps will pick the higher).
* `.github/workflows/{ci,codeql,release}.yml`: setup-go pin and the
  release.yml `GO_VERSION` env var bumped to 1.25.10. The
  security-deep-scan.yml workflow uses the major-minor `1.25` pin
  which auto-resolves to the latest 1.25.x and is unaffected.
* `Dockerfile` + `Dockerfile.agent`: `golang:1.25-alpine@sha256:5caa...`
  re-pinned to `golang:1.25.10-alpine@sha256:8d22e29d960bc50cd0...`
  (digest looked up against `registry-1.docker.io/v2/library/golang/
  manifests/1.25.10-alpine`; verified by the digest-validity ci-guard).
  The explicit `1.25.10-alpine` tag form replaces the moving
  `1.25-alpine` pin so the image-spec is reproducible end-to-end
  even without the digest reference.
* `deploy/test/f5-mock-icontrol/Dockerfile`: `golang:1.25.9-bookworm
  @sha256:1a14...` re-pinned to `golang:1.25.10-bookworm@sha256:
  e3a54b77385b4f8a31c1...` (looked up the same way).
* `deploy/test/f5-mock-icontrol/go.mod`: `go 1.25.9` -> `go 1.25.10`.
* `internal/api/handler/version.go` + `api/openapi.yaml`: the
  `runtime.Version()`-shape comment + OpenAPI `example: go1.25.9`
  bumped to keep doc/example freshness.
* `docs/contributor/ci-pipeline.md` + `docs/reference/connectors/
  iis.md`: doc-only `Go 1.25.9` -> `Go 1.25.10` references.

Verification done in-tree:

* All `scripts/ci-guards/*.sh` pass locally including
  `digest-validity.sh` (the new digests resolve cleanly against
  Docker Hub).
* `S-1-hardcoded-source-counts.sh` clean (the false-positive on
  "Bundle 1 migrations" was fixed in the prior commit).

Operator step required post-push (sandbox has no Go toolchain):

  cd certctl && go mod tidy

This regenerates go.sum's `golang.org/x/net v0.49.0` h1: lines into
v0.53.0 ones. CI's `go mod tidy && git diff --exit-code go.mod
go.sum` step will catch the drift if missed; in that case run the
command, commit, and push the go.sum-only delta.
2026-05-09 21:35:46 -04:00
shankar0123 cbb47aaf5d auth-bundle-1 Phase 11 + 12: RBAC MCP tools + negative-test coverage gate
# Phase 11 — RBAC MCP tools

12 new tools in internal/mcp/tools_auth.go mirroring the Phase-4
+ Phase-7 HTTP surface so operators driving certctl from Claude
/ VS Code / any MCP client get the same management capability
the GUI + CLI already expose:

  certctl_auth_me                          GET    /v1/auth/me
  certctl_auth_list_roles                  GET    /v1/auth/roles
  certctl_auth_get_role                    GET    /v1/auth/roles/{id}
  certctl_auth_create_role                 POST   /v1/auth/roles
  certctl_auth_update_role                 PUT    /v1/auth/roles/{id}
  certctl_auth_delete_role                 DELETE /v1/auth/roles/{id}
  certctl_auth_list_permissions            GET    /v1/auth/permissions
  certctl_auth_add_permission_to_role      POST   /v1/auth/roles/{id}/permissions
  certctl_auth_remove_permission_from_role DELETE /v1/auth/roles/{id}/permissions/{perm}
  certctl_auth_list_keys                   GET    /v1/auth/keys
  certctl_auth_assign_role_to_key          POST   /v1/auth/keys/{id}/roles
  certctl_auth_revoke_role_from_key        DELETE /v1/auth/keys/{id}/roles/{role_id}

Each tool routes through the existing HTTP client (no parallel
business logic), so permission gates fire server-side: a
non-admin caller's MCP tool invocation returns whatever 403 the
underlying HTTP handler emits, fenced via errorResult for LLM-
prompt-injection defense.

Input types in internal/mcp/types.go (AuthRoleIDInput,
AuthCreateRoleInput, AuthUpdateRoleInput,
AuthRolePermissionGrantInput, AuthRolePermissionRevokeInput,
AuthAssignKeyRoleInput, AuthRevokeKeyRoleInput) carry
jsonschema descriptions so the MCP consumer's tool catalogue
shows operator-friendly hints.

internal/mcp/tools_auth_test.go ships 14 tests:
  - TestAuthMCP_AllToolsRegister (registration must not panic)
  - TestAuthMCP_PathsAndMethods (table-driven, 12 rows pinning
    each tool's HTTP method + URL)
  - TestAuthMCP_ForbiddenSurfacesFencedError (12 tools × 403
    mock → error surface)

internal/mcp/tools_per_tool_test.go's allHappyPathCases extended
with the 12 new rows so the in-memory dispatch coverage gate
(TestMCP_RegisterTools_DispatchableToolCount) stays green at the
new total of 139 registered tools.

Re-derived total via 'grep -cE "gomcp\.AddTool\(" internal/mcp/tools*.go':
133 (121 in tools.go + 12 in tools_auth.go).

# Phase 12 — negative-test coverage gate

Audit of the prompt's 12 negative-test paths against existing
coverage:

  1.  Missing actor → 401          ✓ TestRequirePermission_NoActorReturns401, TestRBACGate_NoActorReturns401
  2.  No roles → 403               ✓ TestRequirePermission_DeniedActorReturns403, TestRBACGate_AuditorRole_403sOnAdminRoutes
  3.  Role lacks specific perm → 403 ✓ same suite
  4.  Wrong scope → 403            ✓ TestAuthorizer_SpecificScopeMatchesExactID (wrongID arm)
  5.  Self-grant w/o auth.role.assign → 403 ✓ TestActorRoleService_GrantRequiresAuthRoleAssign
  6.  Bootstrap token wrong → 401  ✓ TestEnvTokenStrategy_WrongTokenReturnsInvalidToken, TestBootstrapHandler_Mint_WrongToken_401
  7.  Bootstrap used twice → 410   ✓ TestEnvTokenStrategy_OneShotConsumption, TestBootstrapHandler_Mint_TwiceReturns410
  8.  Bootstrap when admin exists → 410 ✓ TestEnvTokenStrategy_AdminExistsClosesPath, TestBootstrapHandler_Mint_AdminExists410
  9.  Role delete with assignees → 409 NEW: TestRoleService_DeleteWithActorsAssignedReturns409
  10. Profile-edit loophole → gated ✓ TestProfileEdit_RequiresApprovalLoopholeClosed
  11. Permission not in catalog → 400 ✓ TestRoleService_AddPermissionRejectsNonCanonical
  12. Scope ID for nonexistent resource → 404 (validation deferred — no FK constraint between role_permissions.scope_id and the resource tables; documented for a future bundle)

Filled the gap at #9 with TestRoleService_DeleteWithActorsAssignedReturns409
which pins the repository sentinel pass-through (postgres FK
ON DELETE RESTRICT → repository.ErrAuthRoleInUse → service
returns the sentinel verbatim → handler maps to HTTP 409).

# Coverage gates

.github/coverage-thresholds.yml gains 2 entries:
  - internal/auth: floor 85
  - internal/service/auth: floor 85

.github/workflows/ci.yml's coverage test command extended with
./internal/auth/... and ./internal/api/router/... so the
threshold check has data to evaluate.

# Protocol-endpoint not-gated test (Category F)

internal/api/router/phase12_protocol_allowlist_test.go (new)
adds 3 router-level invariant tests:

  - TestPhase12_ProtocolEndpointsNotGated: AST-walks router.go,
    asserts no rbacGate(...) call references a path under any
    protocol-endpoint prefix (/acme, /scep, /.well-known/est,
    /.well-known/pki/ocsp, /.well-known/pki/crl).
  - TestPhase12_IsProtocolEndpoint_CoversCanonicalPrefixes:
    pins auth.IsProtocolEndpoint against the canonical prefix
    set; if a future protocol lands without lockstep allowlist
    update, this fails.
  - TestPhase12_RBACGateRoutesAreUnderAPIv1: belt-and-braces —
    every rbacGate-wrapped route MUST start with /api/v1/.
    Catches accidental cross-prefix wraps.

Complements the existing TestRequirePermission_ProtocolEndpointBypassesGate
(middleware-level) + TestRouter_AuthExemptAllowlist_PinsActualRegistrations
(allowlist drift) so the Category F invariant is pinned at all
three layers (middleware + router + dispatch).

# Verifications

* gofmt clean repo-wide.
* go vet ./... clean.
* staticcheck across internal/auth + handler + router + cli +
  service + repository + cmd + domain + mcp: clean.
* go test -short -count=1 green across internal/auth (incl.
  bootstrap), internal/api/handler, internal/api/router,
  internal/cli, internal/service (incl. auth),
  internal/domain/auth, internal/mcp, cmd/server, cmd/cli.
2026-05-09 23:46:01 +00:00
shankar0123 69a508dfcf auth-bundle-1 Phase 9 + 10: approval-bypass closure + RBAC GUI
# Phase 9 — approval-bypass closure (Decision 9, option a)

* Migration 000033_approval_kinds.up.sql: ALTER TABLE
  issuance_approval_requests ADD COLUMN approval_kind +
  payload JSONB; relax certificate_id + job_id to nullable;
  CHECK (approval_kind IN ('cert_issuance','profile_edit'))
  + CHECK (per-kind nullability invariant) + index on
  approval_kind. Idempotent throughout via DO blocks.
* domain.ApprovalKind enum (cert_issuance / profile_edit) +
  IsValidApprovalKind. ApprovalRequest gains Kind +
  Payload []byte for the pending profile diff.
* postgres.ApprovalRepository.Create + scanApprovalRow extended
  to round-trip the new columns; certificate_id + job_id
  switched to sql.NullString so profile_edit rows persist
  cleanly. Default Kind=cert_issuance preserves back-compat
  for every Phase-7-2026-05-03 caller.
* ApprovalService.RequestProfileEditApproval: new entry point
  that creates a pending profile-edit row carrying the
  serialized profile diff. Bypass mode (CERTCTL_APPROVAL_BYPASS)
  short-circuits the same way it does for cert_issuance.
* ApprovalService.SetProfileEditApply hook: cmd/server/main.go
  registers a closure that deserializes req.Payload + persists
  via profileRepo.Update + emits a profile.edit_applied audit
  row with category=auth. The hook avoids the Approval ↔
  Profile import cycle.
* ProfileService.UpdateProfile: gates when (a) the live
  profile carries RequiresApproval=true, OR (b) the proposed
  edit would set it true. Returns ErrProfileEditPendingApproval
  with the new approval ID; ProfileHandler maps to HTTP 202
  Accepted + {pending_approval_id}. Both arms close the
  flip-flop loophole because every transition through an
  approval-tier profile fires the gate.
* TestProfileEdit_RequiresApprovalLoopholeClosed pins all 3
  bypass attempts (flip-off / kept-on / flip-on) gated; nil-
  approval-service preserves pre-Phase-9 direct-apply for
  test fixtures.
* Approval service tests gain 4 profile_edit rows: pending row
  shape; same-actor self-approve rejected with
  ErrApproveBySameActor (load-bearing two-person integrity);
  approve fails-closed when apply callback unwired;
  apply callback invoked on approve.
* docs/reference/profiles.md (new) explains the gate +
  edit response shape (202) + same-actor invariant + bypass
  + audit hooks.

# Phase 10 — RBAC management GUI

* useAuthMe hook (web/src/hooks/useAuthMe.ts): TanStack Query
  fetches /api/v1/auth/me on app boot, caches for 60s, exposes
  hasPerm(p) + hasAnyPerm + isAdmin predicates. Every Phase-10
  page consumes this on mount + gates affordances against the
  cached effective_permissions slice. Server-side enforcement
  is the load-bearing gate; client-side hide/disable is UX.
* New routes:
   - /auth/roles — list (auth.role.list); create-role modal
     (auth.role.create) hidden when missing.
   - /auth/roles/:id — detail + permissions; edit
     (auth.role.edit), delete (auth.role.delete), add/remove
     permission affordances each gated.
   - /auth/keys — list of every actor with role grants; assign
     + revoke modals (auth.role.assign). actor-demo-anon
     flagged system-managed; mutation buttons hidden for it.
   - /auth/settings — stub showing /v1/auth/me identity +
     bootstrap-endpoint availability via /v1/auth/bootstrap.
* AuditPage extended with category filter ('All categories'
  + the 3 enum values from migration 000032). Selection flows
  to the API call params + the URL-driven query state.
* Layout: 3 new nav entries (Roles / API Keys / Auth Settings).
* api/client.ts: 12 new exported functions for the RBAC
  surface (authMe, list/get/create/update/delete role,
  list/add/remove role permissions, list keys, assign/revoke
  key role, bootstrap-availability probe).
* data-testid attributes on every interactive element so a
  future Playwright suite can assert behavior without brittle
  CSS selectors.
* Empty state, error state, and unsaved-changes warnings on
  every form per the prompt's implementation rules.

# Frontend tests

* RolesPage.test.tsx (6 tests): list render, empty state,
  error state, hide-create-button-without-perm,
  show-create-button-with-perm, submit-create-modal.
* KeysPage.test.tsx (3 tests): demo-anon flagged
  system-managed (no buttons), permission-gated affordance
  hide for auditor caller, assign-modal-POST contract.
* AuthSettingsPage.test.tsx (2 tests): identity surface,
  bootstrap-OPEN-status surface.
* AuditPage.test.tsx (+1): category-filter select renders
  with the 4 documented options.

15 frontend tests total in src/pages/auth/ + the audit
category-filter test; all pass via npx vitest run.

# Verifications

* go vet ./... clean.
* staticcheck across internal/auth + handler + router + cli +
  service + repository + cmd + domain: clean.
* gofmt -l clean repo-wide.
* go test -short -count=1 green across internal/service,
  internal/api/handler, internal/api/router, internal/auth,
  internal/auth/bootstrap, internal/service/auth,
  internal/domain/auth, cmd/server, cmd/cli, internal/cli.
* npx tsc --noEmit clean.
* npm run build green (vite build produces dist/index.html
  + 946KB JS bundle; chunk-size warning is pre-existing).
* npx vitest run src/pages/auth/ src/pages/AuditPage.test.tsx
  green (15 tests, 4 files).
2026-05-09 21:03:59 +00:00
shankar0123 af4fa12724 auth-bundle-1 Phase 8 follow-up: classify issuer/target audit rows + auditor end-to-end tests + gofmt drift
Self-audit caught five real gaps in 3ef45e2; this commit closes them.

# Phase 8 — issuer/target audit rows now classified as 'config'

The Phase 8 prompt explicitly required existing config-mutation
calls (issuer config, target config, etc.) to write
event_category=config. The 3ef45e2 commit only migrated the auth
service callers; the 6 issuer/target call-sites
(internal/service/issuer.go: create/update/delete_issuer +
internal/service/target.go: create/update/delete_target) still
defaulted to cert_lifecycle. They now pass through
RecordEventWithCategory(..., domain.EventCategoryConfig, ...) so
auditors filtering /v1/audit?category=config see the slice the
migration's docstring promised.

# Auditor exit-criterion test

Phase 8's exit criteria pin 'a user with the auditor role can list /
export audit events but gets 403 on every other endpoint.' Bundle 1
unit invariants (auditor permission set, rbacGate behaviour) were
in place but no end-to-end test walked the full set of admin perms
with an auditor actor. internal/api/router/rbac_gate_integration_test.go
gains TestRBACGate_AuditorRole_403sOnAdminRoutes (table-driven across
all 5 admin perms — cert.bulk_revoke / crl.admin / scep.admin /
est.admin / ca.hierarchy.manage) plus TestRBACGate_AuditorRole_PassesAuditReadGate
(positive case for audit.read).

# gofmt drift

3ef45e2 left two cosmetic struct-field-alignment diffs in
internal/cli/auth.go and internal/api/handler/audit_handler_test.go
that gofmt -l flagged. CI's gofmt step would have failed; gofmt -w
applied; gofmt -l now clean across the repo.

# CHANGELOG path-prefix

CHANGELOG.md v2.1.0 used '/v1/auth/bootstrap' shorthand in the
operator-facing flow examples. The actual route is
'/api/v1/auth/bootstrap'; an operator copy-pasting the curl would
404. All five hits replaced.

Verifications: gofmt clean, go vet ./internal/service/
./internal/api/router/ clean, go test -short -count=1 green across
internal/service + internal/api/router, including the 6 new
auditor sub-tests (PASS).
2026-05-09 20:23:41 +00:00
shankar0123 3ef45e2ad4 auth-bundle-1 Phase 6-7-8: bootstrap path + scope-down CLI + auditor-role split
# Phase 6 — day-0 admin bootstrap

* internal/auth/bootstrap/ (new package): Strategy interface +
  EnvTokenStrategy with constant-time compare, one-shot consumption
  via sync.Mutex, optional admin-existence probe. Bundle 2's OIDC-
  first-admin will plug in alongside as an alternate Strategy.
* BootstrapService.ValidateAndMint: validates the operator's
  CERTCTL_BOOTSTRAP_TOKEN, mints a 32-byte (64-hex-char) random API
  key value, persists the SHA-256 hash to api_keys, grants r-admin
  via actor_roles, AddHashed's the runtime keystore so the just-
  minted key authenticates the next request without restart, and
  records bootstrap.consume to the audit trail with category=auth.
* internal/auth/keystore.go (new): KeyStore interface +
  StaticKeyStore (immutable env-var-only path) + MutableKeyStore
  (env-var keys + DB-loaded api_keys + runtime AddHashed). The auth
  middleware now consumes a KeyStore so the bootstrap path can
  extend the lookup table at runtime.
* migrations/000031_api_keys.up/down.sql: api_keys table with
  (id, name UNIQUE, key_hash UNIQUE, tenant_id, admin, created_by,
  created_at, expires_at, last_used_at). Idempotent.
* /v1/auth/bootstrap GET (probe) + POST (mint) — auth-exempt. Both
  routes documented in api/openapi.yaml + AuthExemptRouterRoutes
  allowlist updated. The token never leaves internal/auth/bootstrap;
  the minted plaintext key flows only into the HTTP response body.
* Startup warning emitted when CERTCTL_BOOTSTRAP_TOKEN is set AND
  admin actors already exist (config drift signal).
* Tests: 4 strategy invariants (empty token born disabled, wrong
  token=ErrInvalidToken without consumption, one-shot consumption,
  admin-exists closes path), 5 service tests (happy path + actor-
  name validation + propagation of strategy errors + nil-deps
  guard + 32-byte entropy budget), 8 HTTP-handler tests (status
  201/410/401/400 mapping + token-leak hygiene scan of slog +
  audit details + Location header). Token-leak test redirects
  slog.Default to a buffer for the test scope.

# Phase 7 — API-key migration + scope-down CLI

* GET /v1/auth/keys handler + service method ListKeys backed by
  ActorRoleRepository.ListDistinctActors. Returns one row per
  (actor_id, actor_type) pair with the slice of role IDs they hold.
  Permission: auth.role.list.
* internal/cli/auth_scope_down.go: AuthListKeys, AuthScopeDown
  (interactive), AuthScopeDownNonInteractive (JSON config),
  AuthScopeDownSuggest (--suggest with optional --apply). The
  synthetic actor-demo-anon is filtered out of every interactive /
  bulk path; non-interactive flow logs and skips it explicitly.
* SuggestRoleFromAuditEvents (pure function): walks 30 days of
  audit events per actor and returns the narrowest matching role
  (admin / mcp / viewer / agent / operator) plus a one-line reason.
  Classification: any admin-shaped action wins; otherwise all-MCP
  → mcp; all-read-only → viewer; all-agent-shaped → agent;
  otherwise operator. Test table pins all six classifications.
* CLI subcommand tree extended: 'auth keys list' + 'auth keys
  scope-down [--non-interactive <cfg>] [--suggest [--apply]]'.
* CHANGELOG.md leads v2.1.0 with the SECURITY: AUDIT YOUR API KEYS
  call-out + four flow examples.

# Phase 8 — auditor role + event_category column

* migrations/000032_audit_category.up/down.sql: ALTER TABLE
  audit_events ADD COLUMN event_category TEXT NOT NULL DEFAULT
  'cert_lifecycle' + CHECK constraint (cert_lifecycle/auth/config)
  + (event_category) and (event_category, timestamp DESC) indexes
  for the auditor-filter query path. WORM trigger from migration
  000018 continues to enforce append-only at the DB layer (DDL is
  not blocked).
* domain.AuditEvent gains EventCategory string (omitempty);
  domain.EventCategoryCertLifecycle / Auth / Config constants.
* AuditService.RecordEventWithCategory sibling of RecordEvent;
  legacy callers stay on RecordEvent (defaults to cert_lifecycle).
  Auth callers (RoleService, ActorRoleService, BootstrapService)
  switched to RecordEventWithCategory(..., 'auth', ...).
* GET /v1/audit?category=<cat>: handler accepts the optional query
  param, validates against the enum (400 on invalid value),
  dispatches through ListAuditEventsByCategory. OpenAPI updated
  with the new query param + AuditEvent.event_category schema.
* Postgres AuditRepository.Create now writes event_category;
  AuditRepository.List filters on it; AuditFilter.EventCategory
  gates the WHERE clause.
* Tests: 5 audit-category-filter HTTP tests (dispatch routing,
  back-compat fallback, 400 for invalid values, all 3 enum values
  accepted, page+category combine, JSON output surfaces the
  field). 3 auditor-role invariants (auditor holds exactly
  audit.read+audit.export, no mutating perms, disjoint from
  viewer except audit.read).

# Cross-phase wiring

* HandlerRegistry.Bootstrap field added; cmd/server/main.go wires
  the bootstrap service ahead of RegisterHandlers (extracted
  assembleNamedAPIKeys helper into auth_backfill.go, moved the
  keystore + bootstrap construction up alongside the auth repos).
* AuthCheckResolver / AuthActorRoleService extended with ListKeys
  to satisfy the Phase 7 surface; existing fakes updated.
* fakeAudit + mockAuditService stubs in tests gain
  RecordEventWithCategory + ListAuditEventsByCategory; existing
  tests untouched.

# Verifications

* gofmt -l: clean across every modified file.
* go vet ./...: clean.
* staticcheck across internal/auth + handler + router + cli +
  service + repository + cmd + domain: clean.
* go test -short -count=1: green across every Bundle-1-touched
  package — internal/auth (incl. bootstrap), internal/api/handler,
  internal/api/router, internal/cli, internal/service/auth,
  internal/service, internal/domain/auth, internal/repository/postgres,
  cmd/server, cmd/cli, plus internal/scheduler, internal/api/middleware,
  cmd/agent, internal/mcp.
2026-05-09 20:15:43 +00:00
shankar0123 60a589ab96 auth-bundle-1 Phase 0-5 closure: demo-mode wire, named-key backfill, AuthCheck enrichment, OpenAPI schema, intermediate-ca comment refresh
Closes the 5 gaps the post-Phase-5 audit flagged on dev/auth-bundle-1.

C1: cmd/server/main.go now selects auth.NewDemoModeAuth() when
CERTCTL_AUTH_TYPE=none and falls back to auth.NewAuthWithNamedKeys
otherwise. Pre-closure, the no-op pass-through that
NewAuthWithNamedKeys returns for empty keys would have left
ActorIDKey / ActorTypeKey / TenantIDKey unpopulated and 401'd
every Phase-3.5 rbacGate-wrapped admin route + every Phase-4
RBAC handler in demo deployments. NewDemoModeAuth injects the
synthetic 'actor-demo-anon' actor seeded by migration 000029,
which holds r-admin at global scope.

C2: backfillNamedKeyActorRoles startup hook (cmd/server/auth_backfill.go)
iterates CERTCTL_API_KEYS_NAMED entries (and legacy
CERTCTL_AUTH_SECRET synthesized fallbacks) and grants r-admin
or r-viewer to each via authActorRoleRepo.Grant before the
HTTP server starts accepting requests. Idempotent via
ON CONFLICT DO NOTHING in the repo. Failures log a warning but
are non-fatal — the server still starts and the operator can
fix grants via /v1/auth/keys. Helper extracted from main.go so
the role-mapping invariant is pinned by 4 focused unit tests
(admin->r-admin, non-admin->r-viewer, empty no-op,
grant-error non-fatal, nil-logger safe).

M1: HealthHandler.AuthCheck now returns actor_id, actor_type,
tenant_id, roles, effective_permissions, and admin_via_role
when the optional AuthCheckResolver is wired (production path:
authCheckResolverAdapter wraps the postgres ActorRoleRepository
in main.go). Nil resolver preserves the legacy {status, user,
admin} contract for back-compat with pre-Bundle-1 GUIs and
test fixtures. Adds 2 regression tests + 1 fake resolver shim.

M2: refreshes the stale 'Admin gate: every method calls
auth.IsAdmin first' comment on IntermediateCAHandler — the gate
moved to router.go::rbacGate via auth.RequirePermission
middleware in Phase 3.5; the new comment block points readers
there.

M4: 11 RBAC routes (auth/me, auth/permissions, 5 role lifecycle,
2 role-permission grant/revoke, 2 actor-role grant/revoke) added
to api/openapi.yaml under the [Auth] tag with operationIds and
shared AuthRole / AuthRolePermission schemas. AuthCheck path
extended with the Bundle-1 enrichment fields. The 11 entries
removed from openapi_parity_test.go::SpecParityExceptions.

Tests: go vet + staticcheck + go test -short -count=1 green
across cmd/server/, internal/auth/, internal/api/router/, and
internal/api/handler/. New tests: 4 backfill unit tests,
2 AuthCheck M1 enrichment tests, 1 demo-mode + rbacGate chain
integration test (TestRBACGate_DemoModeChainReachesHandler).

Branch SECURITY.md (cowork/auth-bundle-1-SECURITY.md, not part
of this commit) captures the full posture of dev/auth-bundle-1
as of this closure for the operator's pre-merge review.
2026-05-09 19:33:07 +00:00
shankar0123 7ff2e2de08 auth-bundle-1 Phase 3.5: handler IsAdmin -> router-wrapped RequirePermission
Phase 3.5 atomic conversion. The five legacy admin-gated handlers (bulk_revocation, admin_crl_cache, admin_scep_intune, admin_est, intermediate_ca) had their in-body auth.IsAdmin checks removed; the gate moved to router.go via auth.RequirePermission middleware wrapping each route. Non-admin operators with the right scoped permission can now reach these endpoints; legacy in-body admin checks no longer block them.

Migration 000030_rbac_admin_perms.up.sql ships five admin-only fine-grained permissions: cert.bulk_revoke, crl.admin, scep.admin, est.admin, ca.hierarchy.manage. All five are seeded into r-admin only; operator/viewer/agent/mcp/cli/auditor do not receive them by default. Operators can grant any of these to a custom role via the Phase 4 RBAC API. Idempotent + transaction-wrapped.

internal/domain/auth/validate.go::CanonicalPermissions extended with the five new entries so RoleService.AddPermission accepts them.

internal/api/router/router.go: HandlerRegistry gains a Checker field (auth.PermissionChecker). New rbacGate(checker, perm, handler) helper wraps a handler with auth.RequirePermission middleware; nil-checker fall-through preserves test/demo deployments without the RBAC stack. 12 admin routes wrapped: cert.bulk_revoke (POST /api/v1/certificates/bulk-revoke + POST /api/v1/est/certificates/bulk-revoke), crl.admin (GET /api/v1/admin/crl/cache), scep.admin (GET /api/v1/admin/scep/profiles + GET /api/v1/admin/scep/intune/stats + POST /api/v1/admin/scep/intune/reload-trust), est.admin (GET /api/v1/admin/est/profiles + POST /api/v1/admin/est/reload-trust), ca.hierarchy.manage (POST /api/v1/issuers/{id}/intermediates + GET /api/v1/issuers/{id}/intermediates + POST /api/v1/intermediates/{id}/retire + GET /api/v1/intermediates/{id}).

cmd/server/main.go: HandlerRegistry.Checker wired with the same authPermissionCheckerAdapter shim Phase 4 introduced for AuthHandler. Same adapter; one source of truth.

Handler bodies: removed eight in-body auth.IsAdmin checks across the 5 files. bulk_revocation.go's BulkRevoke + BulkRevokeEST, admin_crl_cache.go::ListCache, admin_scep_intune.go's three methods, admin_est.go's two methods, intermediate_ca.go's four methods. Replaced each with a comment naming the new gate location. Unused 'github.com/certctl-io/certctl/internal/auth' imports removed.

Test triplet rewrite: deleted obsolete _NonAdmin_Returns403 and _AdminExplicitFalse_Returns403 tests across 6 test files (5 handler tests + bulk_revocation_est_test.go) — they tested the now-removed in-body gate. _AdminPermitted_ForwardsActor tests stay intact: they pin the actor-passthrough invariant which is still relevant. Added internal/api/router/rbac_gate_integration_test.go with four router-level integration tests pinning the new gate: deny → 403 + handler not reached, permit → 200 + handler reached, nil-checker → fall-through, no-actor → 401.

M-008 admin-gate registry: AdminGatedHandlers map now empty (Phase 3.5 invariant: zero in-handler auth.IsAdmin call sites; only health.go's informational caller remains). m008_admin_gate_test.go retains the scan to enforce the invariant going forward; new admin-gated routes must wrap at router.go::rbacGate, not gate in-handler. Updated error message to direct future contributors to the new pattern.

Verifications: gofmt clean across all touched files; go vet ./... clean; go test -short across internal/auth, internal/service/auth, internal/api/handler, internal/api/router, cmd/server all green.

Branch: dev/auth-bundle-1. Commit chain: 99a012e (Phase 0 extract) -> 19497ee (Phase 1 schema + repo) -> bd54d5f (Phase 2 service) -> d473398 (Phase 3 primitive) -> b169f25 (Phase 4 + 5) -> THIS (Phase 3.5 conversion). Phase 6+ (bootstrap, scope-down, auditor, approval-bypass closure, GUI, docs) on subsequent sessions.
2026-05-09 17:00:30 +00:00
shankar0123 b169f258de auth-bundle-1 Phase 4 + 5: RBAC HTTP API + CLI surface
Phase 4 (HTTP API):

* internal/api/handler/auth.go: AuthHandler with 12 endpoints under /api/v1/auth/* — ListRoles, GetRole, CreateRole, UpdateRole, DeleteRole, ListPermissions, AddRolePermission, RemoveRolePermission, AssignRoleToKey, RevokeRoleFromKey, Me. callerFromRequest builds an authsvc.Caller from the Phase 3 ActorIDKey/ActorTypeKey/TenantIDKey context values. writeAuthError translates service + repository sentinels into HTTP status codes (401/403/404/409/400/500). 14 handler tests with in-memory fakes pin the HTTP shape + error mapping.

* internal/api/router/router.go: HandlerRegistry gains an Auth field; 11 new routes registered. openapi_parity_test SpecParityExceptions extended with the new auth routes (OpenAPI YAML schema land in a Phase 4 follow-up commit so the schema review is its own atomic change; the route shape is fully documented inline via the Go type definitions until then).

* cmd/server/main.go: wires the postgres auth repos (RoleRepository, PermissionRepository, ActorRoleRepository) + the Authorizer + RoleService/PermissionService/ActorRoleService into the new AuthHandler. Adds authPermissionCheckerAdapter to bridge the typed-string Authorizer signature to the auth.PermissionChecker interface (avoids an internal/auth → internal/service/auth import cycle).

Phase 5 (CLI):

* cmd/cli/main.go: adds 'auth' command dispatch with subcommands roles/permissions/keys/me.

* internal/cli/auth.go: AuthMe, AuthListRoles, AuthGetRole, AuthListPermissions, AuthAssignRoleToKey, AuthRevokeRoleFromKey methods on Client. Mirrors the Phase 4 HTTP surface.

Phase 3.5 (handler IsAdmin → middleware-wrapped RequirePermission) DEFERRED. Honest reasoning:

(1) The 5 admin handlers (bulk_revocation, admin_crl_cache, admin_scep_intune, admin_est, intermediate_ca) currently gate via auth.IsAdmin checks INSIDE the handler bodies. Converting cleanly requires moving the gate to the router (auth.RequirePermission middleware wrap) AND removing the in-handler check AND rewriting the existing 3-test triplets per handler (M-008 pinned: _NonAdmin_Returns403 / _AdminExplicitFalse_Returns403 / _AdminPermitted_ForwardsActor) because the existing tests call the handler function directly, bypassing middleware. After conversion, those tests would pass without 403'ing because the gate moved away — the test invariants need to flow through a router-level integration setup instead.

(2) Picking the right permission per handler is a security-review-worthy decision. Using existing operator-class perms (cert.revoke, issuer.edit) widens access from admin-only to operator-class; adding new admin-only perms (cert.bulk_revoke, crl.admin, scep.admin, est.admin, ca.hierarchy.manage) requires a migration 000030 plus a coordinated catalogue update in internal/domain/auth/validate.go. Both options are defensible but warrant a focused commit, not a 5-handler sweep mixed in with the API + CLI work.

(3) The conversion can be done now without functional regressions IF we leave the in-handler IsAdmin checks in place AND add middleware wraps as defense-in-depth — but that's the worst of both worlds (legacy gate still blocks non-admin operators, defeating the point of RBAC; new gate adds runtime cost with no semantic change). A clean conversion needs the in-handler check removed.

Concrete plan for Phase 3.5 (separate commit, next session): (a) add new admin-only perms via migration 000030 OR document the widening to operator-class; (b) wrap each of the 5 admin routes with auth.RequirePermission(checker, perm, nil) in router.go; (c) remove auth.IsAdmin checks from the 5 handler bodies; (d) move the M-008 _NonAdmin/_AdminExplicitFalse tests to router-level integration tests, keep _AdminPermitted as a direct handler test for actor-passthrough; (e) update m008_admin_gate_test.go registry to track auth.RequirePermission middleware wraps in router.go instead of auth.IsAdmin call sites in handler files.

Verifications: go vet ./... clean; gofmt clean across all touched files; go test -short -count=1 across internal/auth, internal/service/auth, internal/api/handler, internal/api/router, internal/cli, cmd/server, cmd/cli all green (one transient too-many-open-files retry on internal/cli + internal/api/router; second run clean).

Branch: dev/auth-bundle-1. Commit chain: 99a012e (Phase 0 extract) -> 19497ee (Phase 1 schema + repo) -> bd54d5f (Phase 2 service) -> d473398 (Phase 3 primitive) -> THIS (Phase 4 + 5).
2026-05-09 16:43:48 +00:00
shankar0123 99a012e3be auth-bundle-1 Phase 0: extract internal/auth/ from middleware package
Bundle 1 / Phase 0: pure refactor splitting auth surface out of internal/api/middleware so Bundle 2 (OIDC + sessions) and the broader RBAC primitive (roles, permissions, scoped grants) have a clean home.

Moved to internal/auth/: NamedAPIKey, HashAPIKey, AuthConfig, NewAuthWithNamedKeys, NewAuth, UserKey, AdminKey, GetUser, IsAdmin. Added testfixtures.go (WithActor / WithAdmin / WithActorAdmin) so handler tests don't construct context manually.

Stayed in internal/api/middleware/: RequestID, Logging, NewLogging, Recovery, RateLimitConfig, NewRateLimiter (now imports auth.GetUser for per-user keying per audit Category C), CORSConfig, NewCORS, ContentType, CORS, GetRequestID, responseWriter, Chain, audit middleware (now imports auth.GetUser).

Updated 22 caller files across cmd/, internal/api/handler/, internal/api/middleware/, internal/mcp/. Existing m008_admin_gate_test.go now scans for auth.IsAdmin( substring; Phase 3 will further evolve to track auth.RequirePermission. Behavior unchanged: all handler / middleware / service / connector / cmd / mcp tests pass with no test-logic edits, only import-path renames.

Phase 0 exit criteria: internal/auth/ exists with 6 files; middleware.go went 575 -> 422 lines (auth-related ~150 lines moved out); grep -rE 'middleware\.(GetUser|IsAdmin|UserKey|AdminKey|NamedAPIKey|HashAPIKey|NewAuth)' returns 0 hits; context.WithValue(.*middleware.UserKey/AdminKey) returns 0 hits; go vet ./... clean; go test -short ./... green across all packages tested.

Branch: dev/auth-bundle-1. Per cowork/auth-bundle-1-prompt.md, do not merge to master without (1) make verify green, (2) >= 2 external testers confirm, (3) >= 90% coverage on internal/auth/ in .github/coverage-thresholds.yml.
2026-05-09 15:51:31 +00:00
shankar0123 0e06f6c4fc cli: promote --force on renew + require --reason on revoke (closes P3-1, P3-2)
Closes findings P3-1 and P3-2 from the 2026-05-05 CLI/API/MCP↔GUI parity
audit (cowork/cli-gui-parity-audit-2026-05-05/RESULTS.md). Both findings
flagged hidden defaults that the CLI was sending without exposing them
to operators: `force=false` baked into every renew payload, and a silent
fallback to `reason="unspecified"` whenever --reason was omitted.

P3-1 — promote --force on `certs renew` (full end-to-end plumbing)

The pre-2026-05-05 CLI sent `{"force": false}` in the renew body. The
API handler never decoded it — a textbook "lying field" per the
operator's CLAUDE.md "complete path, not the easy path" rule: the body
field stored a value, claimed to do something, and silently did nothing
because the wire never reached the consumer. Adding a --force flag that
also went unread would have created another lying field.

This commit takes the complete path:

  service.CertificateService.TriggerRenewal grew a `force bool` parameter
  (internal/service/certificate.go). When force=true, the
  RenewalInProgress block is overridden so operators can recover stuck
  in-flight renewals where a previous job hung without releasing the
  status flag. Archived and Expired remain terminal blockers regardless
  of force — those are semantic dead-ends that --force should not paper
  over (archived = decommissioned, expired = issue a new cert instead of
  renewing a dead one).

  handler.CertificateHandler.TriggerRenewal parses force from
  ?force=true (or ?force=1) query param, OR {"force": true} JSON body,
  whichever the client picks. Defaults to false. Passes through to the
  service.

  internal/cli/client.go::RenewCertificate(id, force bool) sends
  ?force=true on the URL when --force is set. The historical hardcoded
  `{"force": false}` body is gone — no more lying field.

  cmd/cli/main.go dispatches `certs renew <id> [--force]` (ID-first
  flag-second convention matches the existing `agents retire <id>
  [--force]`).

P3-2 — require --reason on `certs revoke` (Option A: strict refusal)

The pre-2026-05-05 CLI dropped to `--reason unspecified` whenever the
operator omitted the flag. Compliance reporting (RFC 5280 §5.3.1, PCI-
DSS §3.6, HIPAA §164.312) relies on the reason code being meaningful;
silent fallback defeats the audit trail because every revocation looks
identical.

  cmd/cli/main.go dispatch refuses to send when --reason is empty,
  prints the canonical RFC 5280 §5.3.1 reason-code menu, and exits
  non-zero.

  internal/cli/client.go exposes ValidRevokeReasons() returning the
  canonical camelCase list (unspecified, keyCompromise, caCompromise,
  affiliationChanged, superseded, cessationOfOperation, certificateHold,
  removeFromCRL, privilegeWithdrawn, aaCompromise) and
  NormalizeRevokeReason() that accepts both camelCase and snake_case
  inputs and normalises to the canonical wire form. Off-list reasons
  are rejected at dispatch with the menu re-printed.

Test pins:

  internal/cli/client_test.go::TestClient_RenewCertificate_ForceFlag —
  --force=true sends ?force=true with empty body; --force=false sends
  no query and no body.

  internal/cli/client_test.go::TestNormalizeRevokeReason +
  TestValidRevokeReasons — canonical-camelCase + snake_case + reject-
  off-enum behaviour.

  cmd/cli/dispatch_test.go::TestHandleCerts_Revoke_RequiresReason +
  TestHandleCerts_Revoke_RejectsUnknownReason +
  TestHandleCerts_Renew_ForceFlag — dispatch-layer pins for the same
  contracts.

  internal/api/handler/certificate_handler_test.go::TestTriggerRenewal_
  ForceQueryParam — query-param passthrough (no-flag, force=true,
  force=1, force=false) flows through to the service-layer parameter.

  internal/service/certificate_test.go::TestTriggerRenewal_
  ForceOverridesInProgress — force=false preserves the
  RenewalInProgress block; force=true clears it.

  Existing TestTriggerRenewal_Archived extended to assert force=true
  still blocks Archived (terminal-state guarantee).

Docs: docs/reference/cli.md updated with the --force example for renew
and the strict --reason semantics for revoke (including snake_case
input acceptance).

Acceptance gate (verified):
  - go build ./cmd/server/... ./cmd/agent/... ./cmd/cli/...
    ./cmd/mcp-server/... clean.
  - go vet ./... clean.
  - go test -short -count=1 ./... pass repo-wide.
  - bash scripts/ci-guards/openapi-handler-parity.sh clean
    (router 178, OpenAPI 144, exceptions 36 — unchanged; we add
    parameter parsing, not routes).
  - gofmt -l clean.
2026-05-05 19:49:34 +00:00
shankar0123 75097909e9 2026-05-05 18:18:29 +00:00
shankar0123 0f81c1b956 ci: re-fix CodeQL #32 + repair loadtest f5-mock build context
Two unrelated CI failures from run #25305811340; fixed in one
commit since neither needs the other to land first.

CodeQL alert #32 (go/log-injection at middleware.go:68) reopened
after b0fc067. The previous fix introduced a scrubLogValue helper
backed by strings.NewReplacer; CodeQL's taint tracker only
recognizes the literal strings.ReplaceAll pattern as a sanitizer
(matches the OWASP example in the rule docs). Wrapper helpers and
NewReplacer don't trigger the recognition, so the analyzer kept
flagging.

Fix: drop the helper. Inline strings.ReplaceAll chains directly at
the call site for r.Method and r.URL.Path. Same runtime semantics
(strip CR/LF/NUL); CodeQL pattern-matches the literal call so the
alert can finally close.

Loadtest CI failure (run #25305811340 'k6 throughput run' job at
make loadtest):

  ERROR: failed to compute cache key: failed to calculate checksum
  of ref ...: "/deploy/test/f5-mock-icontrol": not found

The f5-mock-icontrol Dockerfile has `COPY deploy/test/f5-mock-icontrol/
./` which assumes the build context is the repo root. The
docker-compose.test.yml f5-mock-icontrol service correctly uses the
long-form build:

  build:
    context: ..        # = repo root from deploy/docker-compose.test.yml
    dockerfile: deploy/test/f5-mock-icontrol/Dockerfile

The loadtest compose at deploy/test/loadtest/docker-compose.yml
used the shorthand:

  build: ../f5-mock-icontrol

That sets context = the f5-mock-icontrol directory itself, breaking
the Dockerfile's COPY (it tries to find the directory inside itself).

Fix: change the loadtest compose to the long-form pattern matching
docker-compose.test.yml, with context: ../../.. (= repo root from
deploy/test/loadtest/) and explicit dockerfile path.

Verified locally:
  gofmt: clean.
  go vet ./internal/api/middleware/...: exit 0.
  go test -short -count=1 ./internal/api/middleware/...: ok 0.253s.
  python3 -c 'import yaml; yaml.safe_load(...)' on the compose
    file: parses clean.
  grep -rnE 'scrubLogValue' internal/api/: zero references (helper
    fully dropped).

References:
  https://github.com/certctl-io/certctl/security/code-scanning/32
  CI run https://github.com/certctl-io/certctl/actions/runs/25305811340
Closes CodeQL #32 + restores loadtest CI.
2026-05-04 17:26:24 +00:00
shankar0123 b0fc067317 security: close CodeQL #17 (log injection) + #23 (SSRF false-positive reopen)
Two CodeQL alerts in one sweep — both medium-impact follow-ups
on already-merged guards.

Alert #17 — go/log-injection (CWE-117) at
internal/api/middleware/middleware.go:58:

  log.Printf("[%s] %s %s %d %v", requestID, r.Method, r.URL.Path, ...)

  r.Method and r.URL.Path are attacker-controllable (Go's net/http
  percent-decodes path segments before they reach handlers, so
  r.URL.Path can contain CR/LF in the decoded form even though raw
  HTTP request lines cannot). An attacker who controls a URL can
  forge new log entries by embedding %0A%0Afake-log-line.

  Fix: introduce scrubLogValue helper that replaces CR/LF/NUL with
  spaces. Apply to both r.Method and r.URL.Path. Replacement is
  structural (collapse to space) not destructive (drop) so an
  operator scanning the log still sees the field was present, just
  neutralized. Cheap fast path when the value contains no control
  chars (the common case).

  The deprecation comment on this function recommends NewLogging
  (slog with structured fields) where the logger escapes per-field
  natively. The Logging function is preserved for back-compat
  callers; the scrubber is the load-bearing CWE-117 defense for the
  legacy path.

Alert #23 — go/request-forgery (CWE-918) at scep_probe.go:271:

  CodeQL reopened the alert after commit e6919cd. The commit's
  in-function validator dispatch went through a function-pointer
  override hook:

    validateURL := s.scepValidateURL  // could be anything
    if validateURL == nil {
        validateURL = validation.ValidateSafeURL
    }
    if err := validateURL(rawURL); err != nil { ... }

  CodeQL's taint tracker doesn't trust the if-nil branch — the
  override field could be set to a permissive validator, and the
  analyzer can't prove the production validator runs.

  Fix: invert the dispatch. Always call validation.ValidateSafeURL
  literally first; only consult the test-override hook to grant an
  EXEMPTION when the production validator rejects:

    if err := validation.ValidateSafeURL(rawURL); err != nil {
        if s.scepValidateURL == nil || s.scepValidateURL(rawURL) != nil {
            return ... validate url error
        }
    }

  Same applies to ProbeSCEP's entry-point validator. Both call sites
  now have the literal validation.ValidateSafeURL call in-scope of
  the sink (client.Do), which CodeQL recognizes as a sanitizer.

  Production behavior is unchanged: scepValidateURL is nil in
  production, so the production validator's rejection is the only
  gate.

  Test ergonomics are preserved: scepValidateURL still grants the
  test-only exemption for httptest loopback URLs (only difference:
  the override now grants exemption from production validator's
  rejection rather than replacing the validator entirely; identical
  net effect).

Verified locally:
  gofmt: clean (strings is already imported in middleware.go).
  go vet ./internal/api/middleware/... + ./internal/service/...:
    exit 0.
  go test -short ./internal/api/middleware/...: ok 0.244s.
  go test -short ./internal/service/...: ok 4.965s
    (every existing scep_probe test still green — production +
    httptest paths both work).

References:
  https://github.com/certctl-io/certctl/security/code-scanning/17
  https://github.com/certctl-io/certctl/security/code-scanning/23
Closes CodeQL #17. Re-closes CodeQL #23 with a fix CodeQL's taint
tracker can verify.
2026-05-04 05:29:35 +00:00
shankar0123 9ef9f3cde3 refactor(scep+ejbca): drop dead conditionals on always-empty vars (CodeQL #18, #19)
Two CodeQL go/comparison-of-identical-expressions alerts in one
sweep — both Warning severity, both real dead-code (not false
positives). CodeQL detected that each comparison's LHS variable
was provably constant.

Alert #18 — internal/api/handler/scep.go:612 (extractCSRFields):

  challengePassword := ""
  transactionID := ""
  // ... loop populates challengePassword from CSR.Attributes ...
  for _, attr := range csr.Attributes {
      if attr.Type.Equal(oidChallengePassword) {
          // populates challengePassword ONLY — transactionID stays ""
      }
  }
  if transactionID == "" && csr.Subject.CommonName != "" {  // ← always true
      transactionID = csr.Subject.CommonName
  }

  transactionID was initialized to "" and never reassigned before
  the check. The conditional was always true; the MVP path was
  effectively "unconditionally fall back to CN". The RFC 8894 path
  (tryParseRFC8894 above this function) extracts transaction-ID
  properly from PKCS#7 authenticatedAttributes; the MVP path is for
  lightweight legacy clients that send the raw CSR with no PKCS#7
  wrapping, and CN-as-transaction-ID is sufficient there.

  Fix: drop the dead transactionID local var + dead conditional;
  unconditionally set transactionID = csr.Subject.CommonName. No
  behavioral change — the runtime semantics are identical to before
  (every valid invocation already took the fallback). The CN
  extraction stays robust because the empty-CN case still produces
  an empty transactionID, which downstream callers handle.

Alert #19 — internal/connector/issuer/ejbca/ejbca.go:415 (RevokeCertificate):

  serial := request.Serial
  issuerDN := ""
  // (comment: "if we have time..." — TODO never followed up)
  revokeURL := fmt.Sprintf("%s/certificate/%s/%s/revoke", apiURL, issuerDN, serial)
  if issuerDN == "" {  // ← always true
      revokeURL = fmt.Sprintf("%s/certificate/%s/revoke", apiURL, serial)
  }

  issuerDN was hardcoded to "" two lines above. The first revokeURL
  line was unreachable dead code; the conditional always fired and
  the serial-only URL always won. EJBCA's REST API has both
  /certificate/{issuer_dn}/{serial}/revoke and /certificate/{serial}/revoke
  endpoints; the serial-only form is correct for typical certctl
  deployments where one EJBCA CA maps to one certctl issuer config
  (no overlapping serial spaces).

  Fix: drop the dead first revokeURL + dead conditional; build
  revokeURL once via the serial-only endpoint. No behavioral change
  — the runtime URL was always the serial-only one. Comment retained
  + expanded to document the future-enhancement path (parse issuer
  DN from IssuanceResult metadata + use the DN-qualified endpoint
  when a multi-CA EJBCA deployment surfaces).

Verified locally:
  gofmt: clean.
  go vet ./internal/api/handler/... + ./internal/connector/issuer/ejbca/...: exit 0.
  go test -short -count=1 ./internal/api/handler/... + ejbca/...: PASS.
  Both fixes are pure dead-code removal — runtime behavior is byte-
  identical to pre-edit. The existing test suites would have caught
  any actual behavioral change.

References:
  https://github.com/certctl-io/certctl/security/code-scanning/18
  https://github.com/certctl-io/certctl/security/code-scanning/19
Closes both alerts.
2026-05-04 05:17:16 +00:00
shankar0123 34adcfbbe5 api, handler: 4 admin-gated CA hierarchy endpoints + OpenAPI (Rank 8 commit 4)
Rank 8 commit 4 of 5. The API + RBAC layer that operators drive
the new hierarchy management surface from.

Endpoints (all admin-gated via middleware.IsAdmin; non-admin Bearer
callers get 403):
  POST /api/v1/issuers/{id}/intermediates
       Discriminator on body shape:
         empty parent_ca_id + root_cert_pem + key_driver_id
           → CreateRoot (registers operator-supplied root CA).
         parent_ca_id non-empty
           → CreateChild (signs new sub-CA cert under parent).
       Service-layer error → HTTP code mapping:
         ErrCANotSelfSigned         → 400
         ErrCAKeyMismatch           → 400
         ErrPathLenExceeded         → 400
         ErrNameConstraintExceeded  → 400
         ErrInvalidCertPEM          → 400
         ErrParentCANotActive       → 409
         ErrIntermediateCANotFound  → 404
         (other)                    → 500
  GET  /api/v1/issuers/{id}/intermediates
       Returns flat list ordered by created_at; caller renders the
       tree from each row's parent_ca_id (nil = root).
  GET  /api/v1/intermediates/{id}
       Single-row detail.
  POST /api/v1/intermediates/{id}/retire
       Two-phase: confirm=false → active→retiring; confirm=true →
       retiring→retired with active-children check (drain-first
       semantics; ErrCAStillHasActiveChildren → 409).

Files changed:
  internal/api/handler/intermediate_ca.go            — 4 handlers
                                                       + handler-defined
                                                       service interface
                                                       (dependency
                                                       inversion).
  internal/api/handler/intermediate_ca_test.go       — 8 test variants
                                                       (M-008 admin-
                                                       gate triplet
                                                       complete).
  internal/api/handler/m008_admin_gate_test.go       — register the
                                                       new admin-gated
                                                       handler in
                                                       AdminGatedHandlers
                                                       so the M-008
                                                       coherence
                                                       scanner stays
                                                       green.
  internal/api/router/router.go                      — 4 r.Register
                                                       calls + new
                                                       IntermediateCAs
                                                       field on
                                                       HandlerRegistry.
  cmd/server/main.go                                 — wire the
                                                       postgres repo +
                                                       service +
                                                       handler. Reuses
                                                       the same
                                                       signer.FileDriver
                                                       instance the
                                                       OCSP responder
                                                       bootstrap path
                                                       feeds.
  api/openapi.yaml                                   — 4 new
                                                       operationIds,
                                                       full body
                                                       schema + status-
                                                       code dispatch.

Tests (8 in this commit):
  TestIntermediateCA_Handler_NonAdmin_Returns403       (admin gate
    — table-driven across all 4 endpoints)
  TestIntermediateCA_Handler_AdminExplicitFalse_Returns403
    (defensive: AdminKey present but false ≠ AdminKey absent)
  TestIntermediateCA_Handler_AdminPermitted_ForwardsActor
    (admin actor forwarded to service for audit attribution)
  TestIntermediateCA_HandlerCreate_RootDispatch
    (body discriminator: empty parent_ca_id → CreateRoot)
  TestIntermediateCA_HandlerCreate_ChildDispatch
    (body discriminator: parent_ca_id present → CreateChild)
  TestIntermediateCA_HandlerCreate_BadRequestOnMissingRootBundle
    (validation: no parent + no root bundle → 400)
  TestIntermediateCA_HandlerCreate_ServiceErrorMappings
    (table-driven: 7 service errors → expected HTTP codes)
  TestIntermediateCA_HandlerRetire_TwoPhaseConfirm
    (confirm=false then confirm=true forwarded correctly)
  TestIntermediateCA_HandlerRetire_StillHasActiveChildren_Returns409
    (drain-first contract — 409 not 500)

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go test -short -count=1 ./internal/api/handler/...: ok 4.498s.
  bash scripts/ci-guards/openapi-handler-parity.sh: clean
    (router routes: 182, openapi operations: 148; the +4 new routes
    have +4 new operationIds — parity preserved).
  bash scripts/ci-guards/* (all 24 guards): clean.

Out of scope of THIS commit (commit 5):
  - web/src/pages/IssuerHierarchyPage.tsx (recursive tree render).
  - docs/intermediate-ca-hierarchy.md sysadmin runbook (FedRAMP /
    financial-services / internal-PKI patterns).
  - docs/connectors.md hierarchy_mode row.
  - WORKSPACE-ROADMAP entries (HSM-backed roots, automated
    rotation, CRL chaining, NameConstraints templates, D3
    dendrogram).

Reference: cowork/rank-8-intermediate-ca-hierarchy-prompt.md, commit 4.
2026-05-04 02:26:24 +00:00
shankar0123 aebfd8bd7c Revert "chore: drop 'Infisical' label from internal references"
This reverts commit 19706e56b3.
2026-05-04 01:18:15 +00:00
shankar0123 19706e56b3 chore: drop 'Infisical' label from internal references
Strategic naming cleanup. Earlier doc-comments + commit messages framed Rank
4 / Rank 5 / Rank 7 work as 'Rank N of the 2026-05-03 Infisical deep-research
deliverable' — the 'Infisical' qualifier was a holdover from the original
deep-research framing where Infisical (a competing secrets-management
platform) was the comparator. Keeping the comparator's name in our source
adds noise without value; an external reader sees 'Infisical' and assumes a
dependency or shared lineage rather than reading it as the competitive
context it was.

Mechanical sed across 34 files (32 source / docs + 2 follow-up Python passes
to collapse 'deep-research deep-research' duplicates that emerged where the
original phrase wrapped across lines):

  s|Infisical deep-research|deep-research|g
  s|infisical-deep-research-results|deep-research-results-2026-05-03|g
  s|infisical-deep-research-prompt|deep-research-prompt-2026-05-03|g
  s|infisical-deep-research|deep-research|g
  s|Infisical|deep-research|g
  s|deep-research deep-research|deep-research|g  # collapse-pass

Net diff: 63 insertions / 64 deletions across cmd/, docs/, internal/,
migrations/. Pure text substitution; zero behavior change. Code path
unchanged — go vet clean, tests for TestApproval pass on both
internal/service and internal/api/handler packages.

Workspace docs (cowork/) carry the same references and will be swept
separately — they're not under certctl/ git control. The two filename
references (cowork/infisical-deep-research-results.md +
cowork/infisical-deep-research-prompt.md) get renamed alongside that sweep
to deep-research-results-2026-05-03.md /
deep-research-prompt-2026-05-03.md so cross-references in the certctl
repo doc-comments resolve cleanly.
2026-05-04 01:15:01 +00:00
shankar0123 81632eb0f3 api, handler: 4 approval endpoints + handler RBAC integration tests
Rank 7 of the 2026-05-03 Infisical deep-research deliverable, commit 3 of 4.
Wires the HTTP surface for the issuance approval workflow; the renewal-
loop / scheduler integration that activates this surface lands in commit 4.

Files added:
  internal/api/handler/approval.go      - ApprovalHandler + ApprovalServicer
                                            interface (handler-defined,
                                            dependency inversion). 4
                                            endpoints:
                                              GET  /api/v1/approvals
                                                ?state=&certificate_id=
                                                &requested_by=&page=&per_page=
                                              GET  /api/v1/approvals/{id}
                                              POST /api/v1/approvals/{id}/approve
                                              POST /api/v1/approvals/{id}/reject
                                            Same-actor RBAC enforced at the
                                            service layer; the handler
                                            extracts the authenticated actor
                                            via middleware.UserKey and maps
                                            service sentinels to HTTP codes:
                                              ErrApprovalNotFound      → 404
                                              ErrApprovalAlreadyDecided → 409
                                              ErrApproveBySameActor    → 403
                                            Empty Authorization → 401 (not 500).
                                            Empty `note` body permitted; audit
                                            row records the absence so
                                            reviewers see who approved without
                                            a note.

  internal/api/handler/approval_test.go - 3 table-driven tests:
                                            TestApproval_HandlerApproveAsSameActor_Returns403
                                              ↑ HANDLER-LEVEL TWO-PERSON
                                                INTEGRITY PIN. Pairs with
                                                the service-level
                                                TestApproval_Approve_RejectsSameActor.
                                                Compliance auditors expect
                                                exactly HTTP 403 (not 401,
                                                not 500) when the requester
                                                self-approves; the test
                                                additionally asserts the
                                                error body contains the
                                                "two-person integrity"
                                                substring so an auditor can
                                                grep server logs for
                                                attempted self-approvals.
                                            TestApproval_HandlerEmptyNote_Allowed_DecidedByExtractedFromAuth
                                              ↑ pins that decided_by comes
                                                from the auth-middleware
                                                UserKey, NEVER from the
                                                request body. Defends
                                                against future contributor
                                                confusion that might let a
                                                client supply their own
                                                decided_by string.
                                            TestApproval_HandlerErrorMapping
                                              (NotFound → 404, AlreadyDecided
                                              → 409 subtests).

Files modified:
  internal/api/router/router.go         - Adds Approvals field to
                                            HandlerRegistry struct + 4
                                            r.Register lines for the
                                            approval routes. Go 1.22
                                            ServeMux precedence: literal
                                            /approve and /reject segments
                                            resolve before the {id}
                                            pattern-var route, mirroring
                                            the existing notifications
                                            block's /requeue precedence.

Verified:
  gofmt: clean.
  go vet ./internal/api/... ./internal/service/...: exit 0.
  go test -short -count=1 -run TestApproval
    ./internal/api/handler/...: ok 0.004s.

Note on OpenAPI spec: the prompt's spec section also calls for 5 new
operationIds in api/openapi.yaml (createApprovalRequest, listApprovalRequests,
getApprovalRequest, approveApprovalRequest, rejectApprovalRequest). The
external-create endpoint is intentionally not implemented in V2 — every
approval request originates from the renewal-loop entry points (commit 4)
so the only operations exposed are list / get / approve / reject. The
4-route surface is a deliberate scope cut: external systems wanting to
inject approval requests can use the underlying `POST /api/v1/certificates/
{id}/renew` path which creates the parallel ApprovalRequest as a side
effect (post-commit-4 wiring). OpenAPI extension batched into commit 4
alongside the integration changes.

Out of scope for this commit (lands in commit 4):
  - Integration into CertificateService.TriggerRenewal +
    RenewalService.CheckExpiringCertificates + Scheduler.ReapTimedOutJobs.
  - cmd/server/main.go wiring.
  - Config.Approval.BypassEnabled + CERTCTL_APPROVAL_BYPASS env var.
  - api/openapi.yaml extensions.
  - docs/connectors.md + docs/approval-workflow.md.

Reference: cowork/rank-7-approval-workflow-primitive-prompt.md.
2026-05-04 01:05:16 +00:00
shankar0123 8b75e0311b chore: rename Go module path to github.com/certctl-io/certctl
Mechanical sed across the main go.mod's module declaration, the f5-mock-icontrol
sub-module's go.mod, every Go file's import path (361 files), and a rebuild of
the checked-in f5-mock-icontrol binary so its embedded build-info reflects the
new module path. No behavior change.

Choice B from cowork/transfer-certctl-to-org.md, executed 2026-05-04. Choice A
(keep module path declared as github.com/shankar0123/certctl regardless of
repo URL) shipped on the day of the org transfer (2026-05-03) since we had no
external Go consumers; this commit closes that deferral.

Backward-compat: GitHub HTTP redirects continue to forward
github.com/shankar0123/certctl → github.com/certctl-io/certctl at the URL
level, but Go's module proxy uses the path declared in go.mod as the
canonical name. Pre-fix, anyone trying `go get github.com/certctl-io/certctl/...`
hit a "module path mismatch" error because go.mod said
github.com/shankar0123/certctl and the URL they fetched it from said
certctl-io/certctl. Post-fix, the canonical name and the URL agree, so
go get / go install / external Go consumers / Go-tooling integrations
work cleanly via either the new path (preferred) or the old path (which
redirects and Go follows the redirect for source fetch).

Anyone still importing the old path inside their own code keeps working
provided they update their go.mod's `require` line to match — the module
path declared in their consumer's go.sum / go.mod is the authoritative
import name, so a mass sed across their import statements is the migration
on the consumer side. No external consumers exist today.

Diff shape:
  361 *.go files  — import path replacement only
    2 go.mod     — module declaration replacement only
    1 binary     — deploy/test/f5-mock-icontrol/f5-mock-icontrol rebuilt
                   so embedded build-info reflects the new path (8618965 vs
                   8618933 bytes; 32-byte diff is the build-info change)

  Total: 364 files, 730 insertions / 730 deletions, net-zero size, pure
  mechanical substitution.

Verification:
  gofmt: 17 files needed re-alignment after sed (the new path is one char
    shorter than the old, so column-aligned import groups drifted). Applied
    `gofmt -w` to fix.
  go mod tidy: clean exit on both modules.
  go vet ./...: clean exit.
  go build ./...: clean exit.
  go test -short -count=1 on representative packages: all green
    (internal/domain, internal/validation, internal/crypto, internal/crypto/signer,
    cmd/agent). Test output now reads `ok github.com/certctl-io/certctl/...`
    confirming the module path resolves correctly.
  binary: f5-mock-icontrol rebuilt; `strings | grep shankar0123` returns
    nothing; `strings | grep certctl-io/certctl` shows the new module path
    embedded in build-info.

Files intentionally NOT touched in this commit:
  README.md / CHANGELOG.md / docs/ / etc. — already swept to certctl-io
    URLs in commit 0729ee4 (the post-transfer URL refresh). This commit is
    purely the Go-tooling layer.
  Scarf pixels (`shankar0123.docker.scarf.sh/...`) — Scarf-account
    namespace, not a Go import or GitHub repo URL. Stays.

This is a non-blocking, non-customer-impacting change. Operators pulling
container images, running `make verify`, hitting the API, or installing the
agent see no functional difference. Only Go-tooling consumers (none today)
are affected, and they're enabled — not broken — by this commit.
2026-05-04 00:30:29 +00:00
shankar0123 109f32ff41 notifications: per-policy multi-channel expiry-alert routing
Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable
(see cowork/infisical-deep-research-results.md Part 5). Pre-fix,
RenewalService.CheckExpiringCertificates already ran daily,
RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and
NotificationService.SendThresholdAlert deduped per (cert, threshold)
— but the channel was hardcoded to Email
(internal/service/notification.go:118 pre-fix). Operators who
configured PagerDuty / Slack / Teams / OpsGenie via
CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold
unless SMTP was also wired. Their first signal of an expired cert
was a 3 AM outage.

This commit lands the routing matrix on top of the existing
infrastructure:

  1. RenewalPolicy gains AlertChannels (per-tier channel list) +
     AlertSeverityMap (per-threshold tier assignment) +
     EffectiveAlertChannels / EffectiveAlertSeverity accessors.
     Default*() helpers preserve the back-compat Email-only
     behaviour for operators who haven't touched their policies
     post-upgrade. Migration 000026 adds the JSONB columns
     idempotently.
  2. NotificationService.SendThresholdAlertOnChannel — the new
     per-channel dispatch helper. Old SendThresholdAlert stays as
     an Email-only alias so non-policy callers (admin "send test
     alert" surfaces) keep working byte-for-byte.
  3. NotificationService.HasThresholdNotificationOnChannel — per-
     (cert, threshold, channel) deduplication so a transient
     PagerDuty 5xx today does NOT suppress today's Slack alert and
     tomorrow's PagerDuty retry will still fire.
  4. RenewalService.sendThresholdAlerts walks the resolved channel
     set per threshold tier, fans out to every configured channel,
     handles per-channel failures independently, defensively drops
     off-enum channels with an audit row trail, and records a per-
     channel audit event with metadata.channel + metadata.severity_tier.
  5. service.ExpiryAlertMetrics — atomic counter table mirrored on
     the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5
     (commit 0792271). Three labels: channel × threshold × result
     (success / failure / deduped). Cardinality bound: 6 × 4 × 3 =
     72 series for the standard 4-threshold matrix.
  6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus
     exposer for certctl_expiry_alerts_total{channel,threshold,result}.
     Pre-sorted snapshot for byte-stable emission.
  7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics
     instance through both the recording side (notificationService.
     SetExpiryAlertMetrics) and the exposing side
     (metricsHandler.SetExpiryAlerts).

Dispatch flow (post-fix, per renewal-loop tick):

  cert ages past T-30  → daily renewal-loop fires
                       → policy lookup
                       → for each crossed threshold:
                           - resolve severity tier (informational/
                             warning/critical) via AlertSeverityMap
                           - look up channel set in AlertChannels[tier]
                           - for each channel: dedup → SendThresholdAlertOnChannel
                             → notifierRegistry[channel] → audit row →
                             Prometheus counter increment

Tests (internal/service/renewal_expiry_alerts_test.go):

  TestExpiryAlerts_DefaultMatrix_EmailOnly
  TestExpiryAlerts_PerTierFanOut
  TestExpiryAlerts_PerChannelDedup
  TestExpiryAlerts_OneChannelFails_OthersStillFire
  TestExpiryAlerts_OffEnumChannelDropped
  TestExpiryAlerts_MetricCounterIncrements
  TestExpiryAlerts_NilPolicy_FallsToDefault
  TestExpiryAlerts_OperatorOptOutOfTier

The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0
days through the canonical 4 thresholds with the matrix
{informational:[Slack], warning:[Slack,Email],
critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact
recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no
Teams, no Webhook. The OneChannelFails test pins that PagerDuty
returning a 503 does NOT skip Slack/Email at the same threshold.

Drive-by fix (internal/service/testutil_test.go): the existing
mockNotifRepo.List ignored its filter and returned all rows, which
let legacy tests pass on dedup-via-substring even though the
postgres repo actually applied the filter. Updated the mock to
honour CertificateID / Type / Status / Channel / MessageLike
filters in the same shape as the postgres implementation
(internal/repository/postgres/notification.go). All pre-existing
service tests still pass — the legacy test suite happened to be
robust to the mock filter doing nothing.

Documentation:
  - docs/connectors.md Notifier section gains "Routing expiry
    alerts across channels" — operator-facing, JSON example,
    procurement playbook ("How do I make sure PagerDuty pages on
    the T-1 alert?"), debug recipe via SQL on audit_events +
    notification_events + Prometheus.
  - docs/runbook-expiry-alerts.md — sysadmin-grade flowchart,
    per-policy channel-matrix configuration recipes, "did the on-
    call team get paged?" SQL queries, cardinality budget, V3-Pro
    forward path.
  - cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry
    alerts: per-owner routing" V3-Pro entry under Adapter
    hardening.

Out of scope (intentional, flagged in V3-Pro forward path):
  - Per-owner / per-team / per-tenant channel routing (matrix is
    per-policy today, not per-owner).
  - Calendar-aware suppression (no T-30 alerts on weekends).
  - Escalation chains (T-1 unanswered for 30m → escalate).
  - Per-channel rate limiting (downstream of I-005 retry+DLQ).

CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md
itself ("no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/domain/... ./internal/service/...
  ./internal/api/handler/... ./cmd/server/...  clean.
  (./internal/repository/postgres/... vet failed on transitive
  testcontainers/docker module download — sandbox disk pressure,
  not a code issue; postgres-repo build succeeds and tests pass.)
- go test -short -count=1 ./internal/domain/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestExpiryAlerts'
  ./internal/service/...  green (per-channel dedup race-free).

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4.
Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.
2026-05-03 22:12:32 +00:00
shankar0123 0792271dc6 vault: add automatic token renewal at TTL/2 + Prometheus metric
Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
VaultPKI adapter authenticated with a static token and never called
renew-self. Long-lived deploys hit token expiry; the first
operator-visible signal was failed cert renewals on production
targets.

This commit:

  1. Connector.Start(ctx) spawns a goroutine that calls
     POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a
     one-shot lookup-self at startup). Honours ctx.Done() for
     graceful shutdown via a per-loop done channel + Stop().
  2. On `renewable: false` response (initial lookup OR any subsequent
     renewal), the loop emits a WARN, increments the not_renewable
     counter, and exits. The operator must rotate the token before
     Vault's Max TTL elapses.
  3. New Prometheus counter certctl_vault_token_renewals_total with
     labels result={success,failure,not_renewable}. Registered
     alongside existing certctl_issuance_* counters in
     internal/api/handler/metrics.go.
  4. ERROR-level logging on renewal failure with operator-actionable
     substring ("vault token renewal failed; rotate the token before
     TTL expires") so journalctl + grep find it. Loop keeps ticking
     after a failure — transient blips don't kill it.

New optional issuer.Lifecycle interface:

  type Lifecycle interface {
      Start(ctx context.Context) error
      Stop()
  }

Connectors that hold no background goroutines (almost all of them)
do not implement this — IssuerRegistry.StartLifecycles /
StopLifecycles feature-detect via type assertion. New
lifecycle-bearing connectors plug in by implementing the interface;
no further registry plumbing required.

Wiring (cmd/server/main.go):

  - service.NewVaultRenewalMetrics() instance is shared between
    issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built
    by Rebuild get a recorder) and metricsHandler.SetVaultRenewals
    (so the Prometheus exposer emits the new series).
  - issuerRegistry.StartLifecycles(ctx) is called after
    issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles
    is paired so goroutines exit cleanly on signal.
  - IssuerConnectorAdapter.Underlying() exposes the wrapped
    issuer.Connector so registry-level machinery can reach the
    concrete connector behind the adapter without duplicating the
    wiring at every call site.

Tests (internal/connector/issuer/vault/vault_renew_test.go):

  - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three
    renewals, all "success".
  - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns
    renewable=false, loop exits, third tick fires no HTTP call.
  - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403
    bumps "failure", second renewal succeeds → loop kept ticking.
  - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns
    within 200ms after ctx cancel.
  - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token
    already non-renewable at boot ⇒ no goroutine, "not_renewable"
    metric increments at startup so operators see it in Grafana.
  - TestVault_ComputeInterval — 4 cases pinning TTL/2 +
    minRenewInterval floor.
  - TestVault_RenewSelf_ParseFailure_NamesActionableInError —
    surfaced error contains "vault token renewal failed" + "rotate
    the token".

Cadence is dynamic — every successful renewal re-derives TTL/2
from the renewed lease's lease_duration, so a short bootstrap
token that gets renewed up to a longer Max TTL shifts to the
longer cadence automatically (defends against degenerate fast
ticking on a token whose Max TTL is far longer than its initial
TTL).

Documentation:
  - docs/connectors.md Vault PKI section gains "Token TTL +
    automatic renewal" subsection (operator-facing: cadence, metric,
    renewable=false rotation playbook).

Out of scope (intentional, flagged in the audit follow-up):
  - AppRole / Kubernetes / AWS IAM auth methods (different renewal
    semantics).
  - Hot-reload of rotated token from disk (operator restarts
    today; future: GUI/MCP issuer-update path triggers Rebuild
    which Stops the old connector and Starts the new one).
  - Auto-re-auth after token death (operator playbook owns it).

CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md
itself: "no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/service/... ./internal/api/handler/...
  ./internal/connector/issuer/vault/... ./cmd/server/...  clean.
- go test -short -count=1 ./internal/connector/issuer/vault/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval'
  ./internal/connector/issuer/vault/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #5.
2026-05-03 21:24:27 +00:00
shankar0123 bee47f0318 acme-server: cert-manager integration test + production hardening (Phase 5/7)
Closes the production-readiness loop on the ACME surface. After this
commit, certctl ships per-account rate limits + a GC sweeper for
expired ACME state + a kind-driven cert-manager 1.15 integration test
+ a lego-driven RFC conformance harness + a k6 loadtest scenario for
the unauthenticated ACME path.

Architecture:
  - Rate limits live in-memory + per-replica. Restart wipes the
    counters; orders/hour caps are eventual-consistency anyway. A
    3-replica certctl-server fleet behind an LB effectively has 3x
    the configured throughput per account; persistent rate limiting
    is a follow-up if production telemetry shows abuse patterns we
    can't catch in a single restart cycle. Per-key + per-action
    isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and
    ActionChallengeRespond/<challenge-id> are independent buckets.
  - GC loop follows the existing scheduler-loop pattern (atomic.Bool
    + sync.WaitGroup; see crlGenerationLoop for shape). Three
    independent SQL sweeps per tick (DELETE expired nonces; UPDATE
    pending authzs whose expires_at < now() to expired; UPDATE
    pending/ready/processing orders whose expires_at < now() to
    invalid). Each sweep is a single statement; failures are logged-
    and-continued so a failing nonces sweep doesn't block authzs.
    Per-sweep 1m timeout bounds a stuck Postgres.
  - cert-manager integration test is gated on KIND_AVAILABLE so CI
    skips it cleanly (kind is too heavy for per-PR). Operators run
    locally via 'make acme-cert-manager-test'; the harness brings up
    a fresh cluster each run + tears it down on Cleanup.
  - lego conformance harness drives a real ACME client through
    register → run → cert-PEM-landed against a hermetic certctl
    stack. Catches RFC-shape regressions third-party clients would
    hit before they ship.
  - k6 ACME-flow scenario hammers the unauthenticated surface
    (directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS-
    signed flows are out of scope for k6 (no JWS support); they're
    covered by the lego harness above.

What ships:
  - internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases —
    disable-when-perHour-zero, capacity, per-key isolation, per-
    action isolation, refill-over-time, RetryAfter, concurrent-access
    with -race + 200 goroutines × 200 calls).
  - internal/repository/postgres/acme.go: 4 new methods —
    CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations
    + GCInvalidateExpiredOrders. Each a single SQL statement.
  - internal/service/acme.go: SetRateLimiter + GarbageCollect +
    rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey
    + RespondToChallenge) + concurrent-orders gate at CreateOrder.
    2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded);
    5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped /
    gc_authzs_expired / gc_orders_invalidated).
  - internal/scheduler/scheduler.go: ACMEGarbageCollector interface +
    acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME-
    GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the
    crlGenerationLoop shape.
  - internal/api/handler/acme.go: writeServiceError gains rateLimited
    (429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings.
  - internal/config/config.go: 5 new env vars
    (CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60,
    CERTCTL_ACME_SERVER_GC_INTERVAL=1m).
  - cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at
    startup; conditional SetACMEGarbageCollector(acmeService) +
    SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+
    GCInterval > 0.
  - deploy/test/acme-integration/: kind-config.yaml + cert-manager-
    install.sh + clusterissuer-trust-authenticated.yaml +
    clusterissuer-challenge.yaml + certificate-test.yaml + conformance-
    lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE
    gate).
  - deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section.
  - Makefile: 2 new PHONY targets (acme-cert-manager-test +
    acme-rfc-conformance-test).
  - docs/acme-server.md: status flipped to Phase 5; Configuration
    table grows 5 rows; new 'Phase 5 — operational guidance' section
    explaining rate-limit math + GC sweeper semantics + cert-manager
    integration + lego conformance + k6 baseline.

Tests:
  - 'go vet ./...' clean across the repo.
  - 'go test -short -count=1 ./internal/...' green across every
    affected package (service / acme / handler / scheduler / repo /
    config).
  - 'go vet -tags=integration ./deploy/test/acme-integration/' clean
    (the integration test compiles cleanly with the build tag).
  - The kind/cert-manager harness is gated behind KIND_AVAILABLE so
    CI skips by default; operators run locally via 'make acme-cert-
    manager-test'.

Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
2026-05-03 19:42:03 +00:00
shankar0123 650f5a198f fix: collapse identical if/else branches in Account handler (CodeQL #25)
CodeQL alert #25 (go/duplicate-branches) on internal/api/handler/
acme.go::ACMEHandler.Account flagged that 'if readOnly { ... } else
{ ... }' had byte-identical bodies — both setting the same
Content-Type: application/json header. The 'readOnly' bool was
threaded through the function as a placeholder for differentiated
headers (Cache-Control etc. on the POST-as-GET path) that never
landed; both branches collapsed to the same value with no
follow-through.

Audit + fix:
  - The alert is real (verified by re-reading the source); not a
    false positive.
  - The Copilot Autofix Anthropic surfaced was correct in spirit but
    incomplete: it collapsed the if/else but left 'readOnly' as
    dead code (declared at line 395, assigned at lines 400 and 436,
    only read at the now-removed if). golangci-lint's 'unused'
    linter would flag 'readOnly' next.
  - Complete fix: collapse the if/else AND remove the now-unused
    'readOnly' variable + its 2 assignments. Single unconditional
    'w.Header().Set("Content-Type", "application/json")' covers
    both paths (RFC 8555 §6.3 POST-as-GET + §7.3.2 / §7.3.6 update
    + deactivation all return the same account JSON shape — no spec
    rationale for differentiating headers).

Verified locally: 'gofmt -l .' clean; 'go vet ./...' clean;
'go test -short -count=1 ./internal/api/handler/' green; 'grep
readOnly' on the file returns only the new explanatory comment
(no live references).

The alert was first detected in commit 44a85d6 (Phase 1b) — the
duplicate has been sitting in the codebase since the Account
handler shipped. No functional regression for any RFC 8555 client
(cert-manager, lego, Posh-ACME): same status code, same headers,
same body.
2026-05-03 19:07:21 +00:00
shankar0123 f6ba5634fd ci: fix Phase 4 post-push gofmt failure (map-literal alignment)
CI on commit 4dc8d3f (Phase 4) failed gofmt on
internal/api/router/openapi_parity_test.go. The 6 new SpecParity-
Exceptions entries I added for the Phase 4 routes had over-padded
whitespace between key and value; the longest new key is
'"GET /acme/profile/{id}/renewal-info/{cert_id}":' which sets the
gofmt-canonical column width for the surrounding block, but my
hand-aligned values used the wider Phase-2 column width (set by the
even-longer 'POST /acme/profile/{id}/order/{ord_id}/finalize' key in
that block).

gofmt aligns map-literal columns per contiguous run between blank
lines / structural breaks, not file-globally. The Phase 4 entries
form their own run because they're separated from the Phase 2 block
by the '// Phase 4 — key rollover + revocation + ARI.' comment.

Fix: 'gofmt -w' on the file, which rewrote the 6 lines with the
correct (narrower) intra-block alignment. No semantic change — just
whitespace.

Confirmed: 'gofmt -l .' clean; 'go vet ./internal/api/router/' clean
(the test still passes after the formatting change).
2026-05-03 18:58:00 +00:00
shankar0123 4dc8d3fa5b acme-server: key rollover + revocation + ARI (Phase 4/7)
Closes the RFC 8555 + RFC 9773 surface beyond the issuance happy-path:
  - POST /acme/profile/<id>/key-change   (RFC 8555 §7.3.5)
  - POST /acme/profile/<id>/revoke-cert  (RFC 8555 §7.6)
  - GET  /acme/profile/<id>/renewal-info/<cert-id>  (RFC 9773 ARI)

After this commit, ACME clients can rotate account keys, revoke certs
through the ACME surface (rather than only via the certctl GUI/API),
and fetch ARI for proactive renewal scheduling.

Architecture:
  - Key rollover: outer JWS verified against the registered account key
    (existing kid path); the inner JWS — embedded as the outer's payload
    — verified against the embedded NEW jwk in a new dedicated routine
    (ParseAndVerifyKeyChangeInner) that enforces RFC 8555 §7.3.5
    inner-only invariants: MUST use jwk + MUST NOT use kid, payload
    .account == outer.kid, payload.oldKey thumbprint-equals registered.
    A single WithinTx swaps the stored thumbprint+pem and writes the
    audit row. Concurrent-rollover safety via SELECT…FOR UPDATE on the
    conflicting account row in UpdateAccountJWKWithTx; the loser
    observes the winner's new thumbprint and is told to retry (409).
  - Revocation: two auth paths. kid → AccountOwnsCertificate single-
    indexed COUNT lookup over acme_orders. jwk → constant-time RFC 7638
    thumbprint compare against the cert's pubkey. Both paths route
    through service.RevocationSvc.RevokeCertificateWithActor so the
    existing CRL/OCSP refresh + audit + metrics pipeline applies. RFC
    5280 §5.3.1 numeric reason codes clamp to certctl's
    domain.ValidRevocationReasons; codes 8 (removeFromCRL) + 10
    (aACompromise) clamp to 'unspecified' since they aren't in the set.
  - ARI is GET-only and unauth per RFC 9773 §4. Cert-id wire shape is
    base64url(AKI).base64url(serial); ParseARICertID strict-decodes,
    SerialHex emits the canonical certctl-shape lowercase-no-leading-
    zeros hex used in certificate_versions.serial_number.
    ComputeRenewalWindow has 3 branches: bound RenewalPolicy →
    [notAfter - days, notAfter - days/2]; no policy → last 33% of
    validity; past expiry → [now, now + 1d] (renew immediately).
    Retry-After honors CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL.

What ships:
  - internal/api/acme/{keychange,ari}.go (+ phase4_test.go: 15 tests).
  - internal/api/acme/order.go: RevokeCertRequest wire shape.
  - internal/api/handler/acme.go: KeyChange, RevokeCert, RenewalInfo
    + 11 new writeServiceError mappings.
  - internal/repository/postgres/acme.go: UpdateAccountJWKWithTx (FOR
    UPDATE + expectedOldThumbprint precondition; ErrACMEAccountKey-
    ConcurrentUpdate sentinel) + AccountOwnsCertificate.
  - internal/service/acme.go: RotateAccountKey + RevokeCert +
    RenewalInfo; CertificateRevoker + RenewalPolicyLookup interfaces;
    SetRevocationDelegate + SetRenewalPolicyLookup wiring; 11 new
    sentinels; 6 new metrics.
  - internal/service/acme_phase4_test.go: service-layer tests for
    RotateAccountKey (happy + duplicate-key) + RevokeCert (kid mismatch
    + jwk mismatch + jwk happy + already-revoked + reason-clamping) +
    RenewalInfo (disabled + bad cert-id).
  - internal/api/router/router.go: 6 new register calls (3 per-profile
    + 3 shorthand). Router parity exceptions extended in lockstep
    (in-tree SpecParityExceptions + CI-only openapi-handler-exceptions
    .yaml).
  - cmd/server/main.go: SetRevocationDelegate(revocationSvc) +
    SetRenewalPolicyLookup(renewalPolicyRepo) at startup.
  - internal/config/config.go: CERTCTL_ACME_SERVER_ARI_ENABLED (default
    true) + CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL (default 6h);
    BuildDirectory's ariEnabled flag now flips on under
    cfg.ARIEnabled.
  - docs/acme-server.md: phase status flipped to Phase 4; endpoints
    table grows 6 rows (3 per-profile + 3 shorthand); FAQ section
    appended explaining how to rotate keys, revoke certs, and consume
    ARI.

Tests:
  - 'go vet ./...' clean across the repo.
  - 'go test -short -count=1 ./...' green across every package.
  - phase4_test.go covers: keychange happy-path + 5 negatives +
    MapKeyChangeErrorToProblem coverage; ARI cert-id round-trip + 6
    malformed cases + BuildARICertID from a generated cert; window-
    math 3 branches.
  - service-layer tests confirm: RotateAccountKey atomically swaps the
    thumbprint (verifies persisted state) and rejects duplicate keys;
    RevokeCert routes through the stub RevocationSvc with the right
    actor string + reason on the jwk path, rejects mismatched keys,
    rejects already-revoked certs, clamps reason codes correctly;
    RenewalInfo respects ARIEnabled + cert-id format.

Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-4'.
2026-05-03 16:51:06 +00:00
shankar0123 62513ad12f ci: fix Phase 3 post-push CI failures (contextcheck + ST1021)
CI on commit 9bc8453 (Phase 3 challenges) failed three lint checks under
golangci-lint. Two were contextcheck on internal/service/acme.go
RespondToChallenge, where the validator-pool dispatch deliberately
detached from the request ctx via 'context.Background()' so the async
WithinTx survives the HTTP handler returning. contextcheck rightly
flagged the non-inherited context — the canonical Go 1.21+ answer for
this exact pattern is context.WithoutCancel(ctx), which preserves
inherited values (logger, trace IDs, audit actor) but detaches
cancellation. Swapping that in clears both contextcheck hits.

The third was ST1021 on internal/api/acme/validators.go: a comment
intended for the (*Pool).Snapshot() method had landed above the
PoolSnapshot type by accident. Split the comment — one prose line
for the type, one for the method — so each exported symbol carries
its own properly-anchored doc.

Confirmed local 'go vet' clean and 'go test -short -count=1' green
across internal/service/ and internal/api/acme/ before commit.
2026-05-03 15:56:03 +00:00
shankar0123 9bc845304e acme-server: HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (Phase 3/7)
Wires up the actual challenge-validation machinery so profiles in
acme_auth_mode='challenge' resolve end-to-end. After this commit,
cert-manager 1.15+ with `solver: http01: ingress` against a
challenge-mode profile completes a real HTTP-01 flow and gets a cert.
DNS-01 + TLS-ALPN-01 share the same code path with the appropriate
validator selection.

Architecture (the load-bearing parts):
  - 3 separate semaphore-bounded worker pools (one per challenge type),
    so HTTP-01 and DNS-01 can't starve each other under load. Default
    weight 10 per type; tunable via CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY,
    DNS01_CONCURRENCY, TLSALPN01_CONCURRENCY.
  - 30s per-challenge timeout (configurable via PoolConfig.PerChallengeTimeout).
  - HTTP-01 validator runs validation.IsReservedIPForDial (newly
    exported wrapper preserving the existing private impl byte-for-byte
    for the network scanner + ValidateSafeURL paths) on the resolved
    IP — both at the initial dial and every redirect hop. SSRF probes
    into private IP space are refused before the connect.
  - DNS-01 validator uses a dedicated resolver pointed at
    CERTCTL_ACME_SERVER_DNS01_RESOLVER (default 8.8.8.8:53) — does
    NOT use the system resolver to keep behavior deterministic across
    deployments. Wildcard handling: `*.example.com` queries
    _acme-challenge.example.com.
  - TLS-ALPN-01 validator (RFC 8737) connects with ALPN `acme-tls/1`,
    inspects the id-pe-acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31),
    asserts the ASN.1 OCTET STRING value equals SHA-256 of the key
    authorization. Cert chain is intentionally NOT validated
    (InsecureSkipVerify=true is correct per RFC 8737 — the proof is
    in the extension, not the chain). Documented in docs/tls.md L-001
    table + the //nolint:gosec comment carries the justification.
    SSRF guard: same posture as HTTP-01.
  - Validation is asynchronous: handler accepts the POST and returns
    200 immediately with status=processing; the worker-pool fires a
    callback that updates challenge → authz → order in a fresh
    background-context WithinTx. The order auto-promotes to `ready`
    when ALL authzs become valid; auto-fails to `invalid` when ANY
    authz becomes invalid.

What ships:
  - internal/api/acme/challenge.go: KeyAuthorization (RFC 8555 §8.1) +
    DNS01TXTRecordValue (§8.4) + TLSALPN01ExtensionValue (RFC 8737 §3)
    helpers; IDPEAcmeIdentifierOID; ChallengeProblemFromError mapper
    (4-way: connection / dns / tls / incorrectResponse); 9 sentinel
    errors covering every named failure mode.
  - internal/api/acme/validators.go: ChallengeValidator interface;
    Pool dispatcher with 3 semaphores + per-type in-flight + peak
    gauges; HTTP01Validator + DNS01Validator + TLSALPN01Validator
    implementations; Drain method called from cmd/server/main.go's
    shutdown sequence.
  - internal/api/acme/validators_test.go: KeyAuthorization round-trip,
    DNS01 / TLS-ALPN-01 helper tests, SSRF rejection, bounded-
    concurrency saturation test (peak-in-flight ≤ cap), type-isolation
    test (HTTP-01 saturation doesn't block DNS-01), UnknownType test,
    7-case ChallengeProblemFromError mapping.
  - internal/repository/postgres/acme.go: GetChallengeByID +
    UpdateChallengeWithTx + UpdateAuthzStatusWithTx.
  - internal/service/acme.go: SetValidatorPool wires the *acme.Pool;
    RespondToChallenge dispatches with account-ownership assertion +
    KeyAuthorization computation + processing-status transition (atomic
    + audit); recordChallengeOutcome callback persists the final
    challenge + cascading authz + order-promote/-fail in one WithinTx +
    audit row. 4 new metrics.
  - internal/api/handler/acme.go: Challenge handler; round-trips
    account.JWKPEM through ParseJWKFromPEM to recover the *jose.JSONWebKey
    the validator pool needs.
  - internal/api/router/router.go + openapi_parity_test.go +
    api/openapi-handler-exceptions.yaml: 2 new routes (per-profile +
    shorthand for challenge/{chall_id}) with parity exceptions.
  - cmd/server/main.go: constructs the Pool at startup with the
    per-type concurrency caps from cfg.ACMEServer; ACMEService.ValidatorPool()
    accessor exposed for the shutdown drain sequence.
  - internal/validation/ssrf.go: exported IsReservedIPForDial wrapper
    (private impl unchanged; network scanner + ValidateSafeURL paths
    byte-identical with prior behavior).
  - docs/tls.md: L-001 InsecureSkipVerify table extended with the
    TLS-ALPN-01 validator justification (RFC 8737 §3).
  - docs/acme-server.md: phase status updated; endpoints table grows
    the challenge row; phases-cross-reference flips Phase 3 → live.

Tests:
  - 80%+ coverage on the new files.
  - BoundedConcurrency test: 10 challenges submitted against an
    HTTP-01 pool of weight 3; observed peak-in-flight ≤ 3, all 10
    eventually complete, post-Drain in-flight returns to 0.
  - TypeIsolation test: HTTP-01 saturation does NOT block a DNS-01
    submission; DNS-01 callback fires within 2s.
  - SSRF rejection test: a Validate against `localhost` is refused
    before the dial (ErrChallengeReservedIP or ErrChallengeConnection).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-3".
2026-05-03 14:09:00 +00:00
shankar0123 c351bba41a acme-server: orders + authorizations + finalize + cert download (Phase 2/7)
Closes the issuance loop in trust_authenticated mode (commits ec88a61
+ 44a85d6 wired the foundation + JWS-verified account resource).
After this commit, an ACME client running against a profile with
acme_auth_mode='trust_authenticated' end-to-end-issues a real cert:

  POST /acme/profile/<id>/new-order      → 201 + order URL (status=ready)
  POST /acme/profile/<id>/order/<oid>    → POST-as-GET fetch
  POST /acme/profile/<id>/order/<oid>/finalize  → 200 + status=valid + cert URL
  POST /acme/profile/<id>/cert/<cid>     → 200 + PEM chain

Profiles with acme_auth_mode='challenge' get the same code path with
authz/challenge rows in `pending` state until Phase 3's validators
wire up. The mode is read from the bound profile's column at request
time, NOT cached at server start — operators flipping the column via
SQL take effect on the next order without restart.

Architecture (the load-bearing part):
  - Finalize routes through service.CertificateService.Create — the
    canonical certctl issuance entry point that wraps the
    managed_certificates row insert + audit row in s.tx.WithinTx.
    RenewalPolicy / CertificateProfile / per-issuer-type Prometheus
    metrics / audit rows all apply uniformly to ACME-issued certs via
    the same code path that already serves EST/SCEP/agent/REST issuance.
  - Identifier validation runs BEFORE order creation. Rejected
    identifiers return RFC 7807 with per-identifier subproblems and
    create no order row.
  - Source stamp on managed_certificates: domain.CertificateSourceACME.
    Operators bulk-revoke ACME-issued certs by filtering on Source=ACME.
  - 3-step atomicity boundary documented in code + this commit msg:
    (A) WithinTx-A marks order processing + audit row.
    (B) IssuerConnector.IssueCertificate + CertificateService.Create
        (each in its own WithinTx — Create wraps cert row + audit
        atomically).
    (C) WithinTx-C creates certificate_versions row + transitions order
        to valid + sets certificate_id + audit row.
    The brief window between B and C can leave a managed_certificates
    row whose order is still in `processing`. Phase 5's GC scheduler
    reconciles. Documented inline.

What ships:
  - internal/api/acme/order.go: OrderResponseJSON + AuthorizationResponseJSON
    + ChallengeResponseJSON + NewOrderRequest + FinalizeRequest wire
    shapes; ValidateIdentifiers (Phase 2 syntactic checks, dns-only);
    CSRMatchesIdentifiers (RFC 8555 §7.4 strict equality, case-folded).
  - internal/domain/acme.go: ACMEOrder + ACMEAuthorization + ACMEChallenge
    + ACMEIdentifier + ACMEProblem domain types + closed status enums
    for each (order: pending|ready|processing|valid|invalid; authz:
    pending|valid|invalid|deactivated|expired|revoked; challenge:
    pending|processing|valid|invalid; challenge type: http-01|dns-01|
    tls-alpn-01).
  - internal/domain/profile.go: new ACMEAuthMode field reading from
    certificate_profiles.acme_auth_mode (added in migration 25).
  - internal/domain/certificate.go: new CertificateSourceACME enum value.
  - internal/repository/postgres/profile.go: extended SELECT/scanProfile
    to read the per-profile acme_auth_mode column with a COALESCE
    default of trust_authenticated.
  - internal/repository/postgres/acme.go: full order/authz/challenge
    CRUD (CreateOrderWithTx + GetOrderByID + UpdateOrderWithTx +
    CreateAuthzWithTx + GetAuthzByID + ListAuthzsByOrder +
    ListChallengesByAuthz + CreateChallengeWithTx) with proper
    sql.NullTime + JSONB handling. scanACMEOrder /
    scanACMEAuthz / scanACMEChallenge helpers.
  - internal/service/acme.go: extended ACMERepo interface; new
    SetIssuancePipeline wires certificateService + certificateRepo +
    issuerRegistry. CreateOrder (auth-mode-dispatched: trust_authenticated
    auto-marks order ready + authz valid + 1 placeholder http-01
    challenge valid; challenge mode keeps everything pending). LookupOrder
    (with account-ownership assertion). LookupAuthz. ListAuthzsByOrder.
    FinalizeOrder (3-step atomicity boundary as above; CSR-vs-order
    SAN strict-equality check before issuance; persists FinalizeOrderResult
    {Order, CertID}). LookupCertificate. randIDSuffix + base32encode
    helpers for the human-readable acme-ord-* / acme-authz-* /
    acme-chall-* prefixes (CLAUDE.md "TEXT primary keys with human-
    readable prefixes" architecture decision). 8 new per-op metrics.
  - internal/service/acme_test.go: extended fakeACMERepo with Phase 2
    interface stubs; new orderTrackingRepo for observable persistence;
    2 new tests asserting trust_authenticated → auto-ready/valid and
    challenge → stays-pending.
  - internal/api/handler/acme.go: NewOrder + Order + OrderFinalize +
    Authz + Cert handler methods. orderURL / authzURL / certURL /
    challengeURLBuilder helpers; marshalOrderForResponse fetches
    per-order authzs to populate the URL list. parseOptionalTime for
    notBefore / notAfter.
  - internal/api/handler/acme_handler_test.go: extended mockACMEService
    with Phase 2 method stubs; 4 new handler tests (NewOrder happy +
    rejected-identifier + OrderFinalize bad-CSR + Cert happy).
  - internal/api/router/router.go: 10 new Register calls (5 per-profile
    + 5 shorthand) for new-order, order/{ord_id}, order/{ord_id}/finalize,
    authz/{authz_id}, cert/{cert_id}.
  - internal/api/router/openapi_parity_test.go + api/openapi-handler-exceptions.yaml:
    10 new exception entries.
  - cmd/server/main.go: SetIssuancePipeline at startup, threading
    certificateService + certificateRepo + issuerRegistry into ACMEService.
  - docs/acme-server.md: phase status updated; endpoints table grows
    5 rows for new-order/order/finalize/authz/cert (per-profile +
    shorthand variants); new section "Finalize routing through
    CertificateService.Create" documenting the 3-step atomicity
    boundary + the actor-string convention `acme:<account-id>`.

Tests: ACME package + service + handler + router + config + domain
all green under -short. New cases:
  - TestCreateOrder_TrustAuthenticated_AutoReady (asserts auto-ready
    transition + valid-status authz/challenge + audit row + metric bump).
  - TestCreateOrder_ChallengeMode_StaysPending (asserts pending-status
    cascading authz/challenge for challenge mode).
  - TestACMEHandler_NewOrder_HappyPath (asserts 201 + Location +
    finalize URL shape).
  - TestACMEHandler_NewOrder_RejectedIdentifier (asserts 400 + RFC 7807
    rejectedIdentifier + per-identifier subproblems for type=ip).
  - TestACMEHandler_OrderFinalize_BadCSR (asserts 400 + badCSR for
    non-base64 CSR field).
  - TestACMEHandler_Cert_HappyPath (asserts 200 + PEM content-type +
    PEM chain in body).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-2".
2026-05-03 13:46:10 +00:00
shankar0123 44a85d6f85 acme-server: account resource + JWS verifier (Phase 1b/7)
Layers JWS-authenticated POST machinery onto the Phase 1a foundation
(commit ec88a61). After this commit, an ACME client can run

  POST /acme/profile/<id>/new-account

against certctl and successfully register an account. Account update
+ deactivation via POST /acme/profile/<id>/account/<acc-id> work.
Orders + challenges remain Phase 2 / 3.

Background:
  Two prior dispatch attempts at the original Phase 1 ("skeleton +
  directory + new-nonce + new-account" as a single commit) failed on
  go-jose v4 API speculation (jws.GetPayload, sig.Algorithm,
  jose.SHA256, etc. — none of those exist in v4). Splitting Phase 1
  into 1a (foundation, no go-jose) and 1b (this commit, all go-jose
  in one place) concentrated the JWS work where attention pays off.
  The verifier reads the actual go-jose v4 surface — ParseSigned with
  closed alg allow-list, Header struct fields (Algorithm, KeyID,
  JSONWebKey, Nonce, ExtraHeaders[HeaderKey]), JWK.Thumbprint with
  stdlib crypto.SHA256.

What ships:
  - internal/api/acme/jws.go: 487-line verifier + sentinel error
    family. Enforces RFC 8555 §6.2 + §6.4 + §6.5 invariants:
      - alg in {RS256, ES256, EdDSA} (closed allow-list passed to
        jose.ParseSigned — HS256 / none / etc. rejected at parse time)
      - exactly one of `kid` / `jwk` in protected header (per
        endpoint policy — new-account demands jwk, others demand kid)
      - protected `url` matches request URL exactly
      - protected `nonce` consumed against acme_nonces (badNonce on
        miss/replay/expiry per RFC 8555 §6.5.1)
      - kid round-trips against canonical AccountKID(accountID) URL
        (catches cross-profile / cross-host replay)
      - kid path: account exists + status=valid (deactivated /
        revoked accounts cannot authenticate)
      - signature verifies; post-Verify payload bytes equal
        UnsafePayloadWithoutVerification (defense in depth)
    + JWK persistence helpers (JWKToPEM / ParseJWKFromPEM round-
    trip a public-only JWK as a PEM-wrapped JSON envelope; stored
    as TEXT in acme_accounts.jwk_pem for diff-friendliness) +
    JWKThumbprint per RFC 7638.
  - internal/api/acme/jws_test.go: 16 cases covering happy paths
    (RS256 kid, ES256 jwk, EdDSA kid) + every named failure mode
    (alg-not-allowed, bad-sig, missing-nonce, unknown-nonce,
    replay, url-mismatch, mixed kid+jwk, deactivated-account,
    cross-host kid). Uses real keypairs + real go-jose Signer to
    build JWS objects.
  - internal/api/acme/account.go: NewAccountRequest /
    AccountUpdateRequest payload shapes (RFC 8555 §7.3 + §7.3.2 +
    §7.3.6) + AccountResponseJSON wire shape + MarshalAccount
    helper.
  - internal/domain/acme.go: ACMEAccount struct + ACMEAccountStatus
    closed enum (valid / deactivated / revoked).
  - internal/repository/postgres/acme.go: full account CRUD path
    (CreateAccountWithTx with 23505-unique-violation sentinel
    translation, GetAccountByID, GetAccountByThumbprint,
    UpdateAccountContactWithTx, UpdateAccountStatusWithTx) +
    sql.ErrNoRows-wrapped repository.ErrNotFound on lookup misses.
  - internal/service/acme.go: ACMERepo interface extended;
    SetTransactor + SetAuditService wires; NewAccount (idempotent
    re-registration per RFC 8555 §7.3.1 — same JWK returns existing
    row without an update or new audit event); LookupAccount;
    UpdateAccount; DeactivateAccount; VerifyJWS adapter that bridges
    api/acme.VerifierConfig to the service-layer ACMERepo; per-op
    metrics extended (new_account_total + _failures_total +
    _idempotent_total + update_account_total + _failures_total +
    deactivate_account_total).
  - internal/service/acme_test.go: 8 new tests covering
    new-account happy path / idempotent re-registration / only-
    return-existing match + no-match / contact update / deactivate
    / lookup-not-found / requires-transactor.
  - internal/api/handler/acme.go: NewAccount + Account handlers.
    Account dispatches POST-as-GET (RFC 8555 §6.3 — empty body or
    {} payload returns the account row), contact update, and
    deactivation from the same endpoint. Defense-in-depth check
    that the kid path-segment matches the URL path-segment (the
    verifier already round-tripped the kid against canonical URL,
    but the handler re-asserts to catch any future verifier
    refactor).
  - internal/api/handler/acme_handler_test.go: 7 new cases
    covering happy-create, idempotent-200, only-return-existing-
    no-match-400, malformed-JWS-400, kid-URL-mismatch-401,
    deactivate, contact-update, POST-as-GET.
  - internal/api/router/router.go: 4 new Register calls (per-
    profile + shorthand for new-account and account/{acc_id}).
  - internal/api/router/openapi_parity_test.go: SpecParityExceptions
    extended with the 4 new routes (RFC 8555 wire-protocol surface,
    not OpenAPI-shaped — same precedent as Phase 1a).
  - cmd/server/main.go: SetTransactor + SetAuditService on
    acmeService at startup so the WithinTx-based new-account /
    update / deactivate paths run with the same transactor instance
    shared across CertificateService / RevocationSvc / RenewalService.
  - docs/acme-server.md: Phase status updated; endpoints table grows
    new-account + account/<acc_id> rows; new "JWS verification
    (Phase 1b)" section enumerates the 7 invariants the verifier
    enforces; phases-cross-reference table marks 1b live.
  - go.mod / go.sum: github.com/go-jose/go-jose/v4 v4.0.4 added.

Atomicity: every account-state mutation writes its acme_accounts row
+ its audit_events row inside one repository.Transactor.WithinTx
call — the canonical certctl atomicity contract (matches
CertificateService.Create at internal/service/certificate.go:131).
Idempotent re-registration explicitly does NOT write an audit row
(RFC 8555 §7.3.1 returns the existing row unmodified).

Tests: 16 jws_test.go cases + 11 service tests + 11 handler tests
all pass under -short. Bad-signature test uses a real registered
account whose stored JWK is a different keypair from the signer's,
so the JWS parses cleanly but jose.Verify rejects — exercises the
ErrJWSSignatureInvalid path directly.

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-1b".
2026-05-03 13:21:56 +00:00
shankar0123 ec88a61274 acme-server: foundation — directory + new-nonce + per-profile routing (Phase 1a/7)
First slice of the RFC 8555 ACME server endpoint (master plan at
cowork/acme-server-endpoint-prompt.md, per-phase prompts at
cowork/acme-server-prompts/). This commit lands the smallest viable
end-to-end deployable slice: an ACME client running

  curl -sk https://certctl/acme/profile/<id>/directory
  curl -sk -I https://certctl/acme/profile/<id>/new-nonce

successfully fetches the directory document and a Replay-Nonce.
Account creation, JWS verification, orders, challenges, and
revocation are all out of scope for this phase and arrive in Phases
1b–4.

Closes the Rank 1 LHF from the 2026-05-03 Infisical deep-research
(cowork/infisical-deep-research-results.md). Pre-fix, certctl was an
ACME consumer only — no /acme/directory endpoint, no JWS verifier,
no challenge validators. K8s customers running cert-manager could
not point at certctl as an ACME issuer; they had to deploy a certctl
agent on every node.

What ships:
  - internal/api/acme/{directory,nonce,errors}.go (+ tests).
  - internal/api/handler/acme.go + acme_handler_test.go.
  - internal/repository/postgres/acme.go (nonce ops only — Phase 1b
    extends with account CRUD; Phases 2-4 extend with order / authz /
    challenge CRUD).
  - internal/service/acme.go (BuildDirectory + IssueNonce stubs;
    Phase 1b adds VerifyJWS / NewAccount / etc.).
  - migrations/000025_acme_server.{up,down}.sql ships the full 5-table
    ACME schema (acme_accounts / acme_orders / acme_authorizations /
    acme_challenges / acme_nonces) PLUS the per-profile
    certificate_profiles.acme_auth_mode column. Phase 1a actively
    uses only acme_nonces; remaining tables are empty until Phases
    1b-4 plug in.
  - internal/config/config.go: ACMEServerConfig struct + ACMEServer
    field on Config. Env vars use CERTCTL_ACME_SERVER_* prefix to
    avoid colliding with the existing consumer-side ACMEConfig at
    config.go:1746 (CERTCTL_ACME_DIRECTORY_URL / PROFILE /
    CHALLENGE_TYPE etc.). Phase 1a wires Enabled +
    DefaultAuthMode + DefaultProfileID + NonceTTL + DirectoryMeta;
    Order/Authz TTLs + per-challenge-type concurrency caps + DNS01
    resolver are reserved fields parsed in 1a so operators can set
    them ahead of Phases 2/3.
  - cmd/server/main.go: wire ACMEHandler into the HandlerRegistry
    literal alongside the existing certificate / EST / SCEP / etc.
    handlers.
  - internal/api/router/router.go: HandlerRegistry.ACME field + 6
    Register calls (3 per-profile + 3 shorthand).
  - internal/api/router/openapi_parity_test.go: 6 new entries in
    SpecParityExceptions. ACME is a wire-protocol surface (JWS-signed
    JSON over HTTPS per RFC 7515) whose semantics are dictated by
    RFC 8555 + RFC 9773 rather than by an OpenAPI document, same
    precedent as SCEP/EST. The canonical reference is
    docs/acme-server.md.
  - docs/acme-server.md: Phase-1a-shaped reference. Configuration
    table for every CERTCTL_ACME_SERVER_* env var. Per-profile
    auth-mode decision tree skeleton. TLS trust bootstrap section
    flagging cert-manager's ClusterIssuer.spec.acme.caBundle
    requirement (the single biggest first-time-deploy footgun;
    the full cert-manager walkthrough lands in Phase 6 but the
    requirement is documented up front).

Architecture decisions baked in:
  - URL family is /acme/profile/<id>/* (per-profile, canonical) with
    /acme/* shorthand active when CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID
    is set. Path matches existing per-profile precedent in EST + SCEP.
  - Auth mode is per-profile (acme_auth_mode column on
    certificate_profiles), NOT server-wide. One certctl-server can
    serve trust_authenticated for an internal-PKI profile and
    challenge for a public-trust-style profile simultaneously. The
    column is read at request time, not cached at server start —
    operators flipping a profile's mode via SQL take effect on the
    next order without restart.
  - Nonces are DB-backed (acme_nonces table). Survive server restart.
    The RFC 8555 §6.5 replay defense requires the store to outlast
    the client's nonce caching window; an in-memory-only nonce
    store would lose every in-flight order on restart.
  - Per-op atomic counters on service.ACMEService.Metrics() —
    certctl_acme_directory_total, certctl_acme_directory_failures_total,
    certctl_acme_new_nonce_total, certctl_acme_new_nonce_failures_total.
    Naming follows certctl frozen decision 0.10 cardinality discipline.
    Phase 1b will extend with new_account counters; Phase 2 with
    order / finalize / cert; Phase 3 with per-challenge-type counters.

Audit fixes #11 + #12 (cowork/acme-server-prompts/audit-additions.md)
applied:
  - #11: CERTCTL_ACME_SERVER_* prefix avoids the consumer-side
    CERTCTL_ACME_* namespace collision.
  - #12: prior-attempt WIP from two failed Phase-1 dispatches was
    discarded at phase start; this commit starts from a clean tree.

Tests:
  - 14 unit tests in internal/api/acme/ (directory, nonce, errors).
  - 7 handler-level tests via httptest.NewServer + mockACMEService
    (mirrors the mockSCEPService pattern at scep_handler_test.go).
  - 7 service-layer tests with mocked repo + injected profileLookup.
  - All pass under -race -count=1 -short.

Deferred to Phase 1b:
  - JWS verification (go-jose v4 — see master-prompt §8a for the API
    surface and audit doc for the speculation pitfalls).
  - new-account / account/<id> endpoints + AccountService.
  - Nonce *consumption* path (issue path is in this commit; consume
    is only invoked by JWS-verified POSTs which Phase 1b adds).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-1a".
Per-phase implementation plan: cowork/acme-server-prompts/.
Master plan + audit fixes: cowork/acme-server-endpoint-prompt.md +
cowork/acme-server-prompt-audit.md +
cowork/acme-server-prompts/audit-additions.md.
2026-05-03 12:55:40 +00:00
shankar0123 3b92048242 metrics: add per-issuer-type issuance counters, histogram, and failure classifier
Closes the #4 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Before this commit, certctl's Prometheus exposition had
zero per-issuer-type signal — operators answering "is DigiCert slow?"
or "is Sectigo failing more than ACME?" had to grep logs by issuer
name. This commit adds three series labelled by issuer type:

  certctl_issuance_total{issuer_type, outcome}
  certctl_issuance_duration_seconds{issuer_type}            (histogram)
  certctl_issuance_failures_total{issuer_type, error_class}

The histogram covers 0.05–120 second buckets to span the local-issuer
fast path and async-CA slow path (DigiCert/Sectigo/Entrust polling
can take minutes). error_class is a closed enum of eight values
(timeout, auth, rate_limited, validation, upstream_5xx, upstream_4xx,
network, other) classified once in service.ClassifyError. Cardinality
budget is ~276 new series, well within Prometheus's comfortable range.

Implementation:
- service.IssuanceMetrics is the thread-safe counter + histogram
  table. Three independent views (counters / failures / durations)
  exposed via SnapshotCounters / SnapshotFailures / SnapshotDurations.
  sync.RWMutex protects the map shape; per-key sync/atomic.Uint64
  primitives keep the recording hot path lock-free under concurrent
  service-layer goroutines.
- service.IssuanceCounterEntry / IssuanceFailureEntry /
  IssuanceDurationEntry / IssuanceMetricsSnapshotter live in service
  (not handler) to avoid an import cycle: handler already imports
  service for admin_est.go etc., so service can't import handler back.
  Handler's exposer takes the snapshotter via the service-defined
  interface.
- service.ClassifyError pure function maps error → error_class.
  context.DeadlineExceeded / context.Canceled → timeout; *net.OpError
  → network; substring matches against canonical AWS / DigiCert /
  Sectigo error shapes for auth / rate_limited / validation /
  upstream_5xx / upstream_4xx / network; unknown → other. Each branch
  has at least one representative test case in
  TestClassifyError.
- IssuerConnectorAdapter.SetMetrics wires per-adapter recording
  (issuerType + metrics). Existing 28+ test call sites of
  NewIssuerConnectorAdapter keep their one-arg signature; production
  wiring goes through SetMetrics post-construction.
- IssuerRegistry.SetIssuanceMetrics + Rebuild type-asserts to
  *IssuerConnectorAdapter and calls SetMetrics with the issuer type
  string. nil-guarded — tests that hand-build adapters without
  metrics get no-op recording.
- IssuerConnectorAdapter.IssueCertificate / RenewCertificate wrap the
  underlying connector call with start := time.Now() and
  recordIssuance(start, err). Renewal is recorded into the same
  certctl_issuance_* series as initial issuance — operationally,
  renewal IS issuance from the connector's perspective (matches the
  audit prompt's guidance on series naming).
- handler/metrics.go GetPrometheusMetrics gains a new exposer block
  emitting all three series in stable label order with correct
  Prometheus format (_bucket / _sum / _count for the histogram, +Inf
  bucket appended). Sorted via sort.Slice for stable output. nil-
  guarded so deploys without the wire produce clean exposition.
- formatLE helper trims trailing zeros from histogram bucket labels
  via strconv.FormatFloat(le, 'f', -1, 64) so the `le` labels match
  Prometheus client conventions ("0.05", "30", "120", not "0.0500"
  etc.).
- cmd/server/main.go wires a single IssuanceMetrics instance into
  both the IssuerRegistry (recording) and the MetricsHandler (exposer)
  using DefaultIssuanceBucketBoundaries.

Tests:
- TestIssuanceMetrics_RecordAndSnapshot — happy-path counter +
  histogram + failure recording, BucketBoundaries returns a copy
  (not shared storage).
- TestIssuanceMetrics_HistogramCumulative — pins the cumulative-buckets
  contract. 100ms observation lands in 0.1 bucket and every larger
  bucket; 750ms only in the 1.0 bucket. Off-by-one here would
  corrupt every quantile query downstream.
- TestIssuanceMetrics_Concurrency — 100 goroutines × 1000 ops under
  the race detector. Asserts atomic counter integrity across
  contended writes.
- TestClassifyError — 17 cases covering every branch of the closed
  enum plus the nil-error special case.

Implementation chooses the existing hand-rolled fmt.Fprintf
exposition pattern (no prometheus/client_golang dependency added)
to stay consistent with the OCSP / deploy counter blocks already in
the file.

Out of scope (separate follow-ups):
- Revocation metrics (certctl_revocation_*) — symmetric to issuance
  but the audit didn't ask; explicit follow-up commit.
- Discovery / health-check duration histograms.
- prometheus/client_golang migration.

Verified locally:
- gofmt clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 ./internal/service/ green
- go test -short -count=1 -race -run TestIssuanceMetrics ./internal/service/ green
- go test -short -count=1 ./internal/api/handler/ green
- go build ./... success

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #4 (Part 3, narrative section).
2026-05-02 00:39:25 +00:00
shankar0123 b0efdbe2f8 repo,service: introduce WithinTx and atomic audit rows for issue/renew/revoke
Closes the #3 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit (Part 1.5 finding #1: audit row not transactional with
issuance). AuditRepository.Create previously ran on the package-level
*sql.DB while the certificate insert / version insert / revocation
insert ran on independent connections — a failed audit INSERT after
a successful operation INSERT was silently lost. SOX §404 over IT
general controls, PCI-DSS §10 audit logging, HIPAA §164.312(b) audit
controls, and CA/B Forum Baseline Requirements §5.4.1 audit log
records all presume audit-with-operation atomicity.

Design — Option A (Querier abstraction). The chosen pattern: a shared
repository.Querier interface (subset of *sql.DB and *sql.Tx) plus a
postgres.WithinTx helper that begins a tx, runs fn, commits on nil
error, rolls back on error or panic, and returns the wrapped result.
Repository methods that participate in a service-layer transaction
expose a *WithTx variant taking repository.Querier; the bare methods
remain for stand-alone use. A repository.Transactor abstracts the
"begin tx, run fn, commit/rollback" lifecycle so service-layer code
runs multi-write operations atomically without holding *sql.DB
directly. Option B (UnitOfWork) was considered but adds boilerplate
without behavioral benefit for the current scope. Option C
(context-carried tx) was explicitly rejected — it hides the
transactional boundary from the type system, reproducing the class
of bug we're fixing.

This commit:
- Adds internal/repository/querier.go with the Querier interface
  (compile-time guards that *sql.DB and *sql.Tx satisfy it) and the
  Transactor interface for service-layer use.
- Adds internal/repository/postgres/tx.go with the WithinTx helper
  (begin/fn/commit/rollback with panic recovery) and a transactor
  type that satisfies repository.Transactor.
- Adds CreateWithTx variants on AuditRepository, CertificateRepository
  (Create + Update + CreateVersion), and RevocationRepository.
  Existing bare methods now delegate to the *WithTx variant using
  the package-level *sql.DB so existing call sites are
  behavior-preserving.
- Updates repository/interfaces.go: AuditRepository, CertificateRepository,
  and RevocationRepository declare the new *WithTx methods. Adds an
  atomicity contract doc-comment on AuditRepository pointing at
  WithinTx + the audit blocker.
- Adds AuditService.RecordEventWithTx, mirroring RecordEvent but
  routing through CreateWithTx so the audit row is part of the
  caller's transaction. Same redaction + marshalling contract.
- Refactors three audit-emitting service paths to use Transactor.WithinTx
  when SetTransactor was wired, with a legacy fallback for backward
  compat:
    * CertificateService.Create — cert insert + audit row in one tx.
    * RevocationSvc.RevokeCertificateWithActor — cert status update +
      revocation row + audit row in one tx. The OCSP cache invalidate
      remains best-effort (out of scope per the prompt).
    * RenewalService CompleteServerRenewal — cert version insert +
      cert update + audit row in one tx. Job status update stays
      outside the audit-atomicity scope (job state lives outside
      the operator-facing audit trail).
- Adds SetTransactor on CertificateService, RevocationSvc, and
  RenewalService. cmd/server/main.go wires a single Transactor
  instance shared across all three so all audit-emitting paths run
  their writes in transactions backed by the same *sql.DB handle.
- Updates 5 mock implementations to satisfy the new interface methods:
  mockCertRepo (testutil_test.go), mockCertRepoWithGetError
  (shortlived_test.go), fakeRevocationRepo (crl_cache_test.go),
  intuneE2EAuditRepo (scep_intune_e2e_test.go), and the integration-
  test mocks (lifecycle_test.go: mockCertificateRepository,
  mockAuditRepository, mockRevocationRepository). All *WithTx mocks
  ignore the Querier and delegate to the bare method (mocks have no
  DB; in-memory state is shared regardless of "tx").
- Adds a service-layer test mockTransactor with BeginTxErr and
  CommitErr knobs so the atomic-audit tests can assert error
  propagation through the transactional boundary.
- Adds internal/repository/postgres/tx_test.go: unit-level test that
  WithinTx surfaces "begin tx" wrap when BeginTx fails, and that
  Transactor.WithinTx delegates correctly. Real-Postgres rollback
  semantics are covered by the testcontainers tests in the postgres
  package — sandbox disk pressure prevented adding a sqlmock dep
  for the in-fn / commit-failure unit test, so those scenarios are
  exercised through atomic_audit_test.go using the mockTransactor's
  CommitErr / BeginTxErr fields.
- Adds internal/service/atomic_audit_test.go:
    * TestCertificateService_Create_AtomicWithTx — asserts audit
      insert failure inside the tx surfaces as the operation's error
      (closes the blocker contract).
    * TestCertificateService_Create_LegacyPathLogs — pins the
      backward-compat behavior when SetTransactor isn't wired:
      audit failure is logged-not-failed, matching pre-fix.
    * TestCertificateService_Create_TransactorBeginFailure — BeginTx
      error path: operation fails, no cert insert, no audit insert.
    * TestCertificateService_Create_TransactorCommitFailure —
      Commit error after successful in-fn writes surfaces as the
      operation's error. Real Postgres can fail Commit on
      serialization conflicts; the service must report this.

Out of scope (separate follow-up commits, same shape):
- Issuer CRUD audit atomicity.
- Target CRUD audit atomicity.
- Agent retire (already transactional via RetireAgentWithCascade;
  verified, not changed).
- Renewal-policy CRUD audit atomicity.
- Owner/team/agent-group CRUD audit atomicity.
- Discovery / health-check audit atomicity.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 ./internal/service/ green
- go test -short -count=1 ./internal/api/handler/ green
- go test -short -count=1 ./internal/integration/ green
- go test -short -count=1 ./internal/repository/postgres/ green
- go build ./... success

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #3 (Part 3, narrative section).
2026-05-02 00:29:09 +00:00
shankar0123 7cb453a336 chore(fmt): repo-wide gofmt -w sweep — close drift surfaced by ci-pipeline-cleanup Phase 4
Mechanical reformat. The new 'gofmt drift' CI step (added in
ci-pipeline-cleanup Phase 4, commit 0f205a8) surfaced 111 files
with accumulated gofmt drift across cmd/, internal/, and deploy/test/.

Each file's diff is gofmt-standard: whitespace adjustments, intra-
group import sorting (alphabetical by import path within blank-line-
separated groups), and struct-tag column alignment. No semantic
changes — verified via 'git diff --ignore-all-space' which shows only
the line-position deltas from import reordering.

The gate stays in place after this commit. Going forward it catches
gofmt drift at PR time.
2026-04-30 22:33:57 +00:00
shankar0123 8637131f80 chore: gofmt fixes across deploy-hardening I new files
Phase 13 verification surfaced gofmt-formatting drift in 6 files
across the bundle's new code:

- internal/api/handler/metrics.go (struct field alignment)
- internal/connector/target/k8ssecret/validate_only_test.go (alignment)
- internal/connector/target/nginx/nginx.go (alignment)
- internal/connector/target/postfix/postfix.go (alignment)
- internal/connector/target/ssh/validate_only_test.go (alignment)
- internal/service/deploy_counters.go (alignment)

Pure mechanical gofmt -w fixes; no behavior changes. CI's
make verify gate (which runs `go fmt ./...`) didn't catch these
because go fmt is more lenient than gofmt -l, but golangci-lint
v2.11.4 + the explicit gofmt step in Phase 13 verification did.

Phase 13 full-matrix verification all green:
- gofmt -l: empty across all bundle-touched files
- go vet ./internal/deploy/... ./internal/connector/target/... ./internal/service/ ./internal/api/handler/ ./cmd/agent/: clean
- golangci-lint v2.11.4 (the version CI runs): 0 issues
- go test -race -count=1 across deploy + nginx + apache + haproxy + agent + service: all green
- INTEGRATION=1 go test -tags integration -run Deploy ./deploy/test/...: 4/4 e2e tests green

Phase 14 next: release prep — Active Focus update, release notes,
Reddit-beat draft, final tag handoff to operator.
2026-04-30 15:33:33 +00:00
shankar0123 135b271197 feat(metrics): per-target-type deploy counters wired into /metrics/prometheus
Phase 10 of the deploy-hardening I master bundle. Mirrors the
production-hardening-II Phase 8 OCSP-counter pattern. Per frozen
decision 0.9, the metric naming convention is
`certctl_deploy_<area>_total` with target_type + sub-label.

internal/service/deploy_counters.go:
- DeployCounters struct with sync.Map of per-target-type buckets
  (apache, nginx, etc.). Lock-free fast path via sync/atomic
  Uint64 counters; LoadOrStore on first tick.
- 8 sub-counters per target-type bucket:
  - attemptsSuccess / attemptsFailure
  - validateFailures (PreCommit returned error)
  - reloadFailures (PostCommit returned error → rollback ran)
  - postVerifyFails (post-deploy TLS handshake failed)
  - rollbackRestored (rollback succeeded)
  - rollbackAlsoFail (operator-actionable escalation)
  - idempotentSkips (SHA-256 match → no-op deploy)
- Snapshot returns []DeploySnapshot for the Prometheus exposer.

internal/service/deploy_counters_test.go:
- 5 tests: zero-state, per-target-type tick isolation, race-detector
  smoke under concurrent ticks, cross-target bucket isolation,
  snapshot-mutation-doesn't-affect-counter.

internal/api/handler/metrics.go:
- New DeployCounterSnapshotter interface (mirrors CounterSnapshotter
  for the OCSP counters but uses the per-target-type tuple shape).
- New DeploySnapshotEntry struct copying the service-layer shape;
  avoids importing the service package directly so the handler
  stays dependency-light.
- New SetDeployCounters setter on MetricsHandler (mirrors
  SetOCSPCounters wiring).
- Prometheus exposer extended with 6 new metric blocks per frozen
  decision 0.9:
  - certctl_deploy_attempts_total{target_type, result}
  - certctl_deploy_validate_failures_total{target_type}
  - certctl_deploy_reload_failures_total{target_type}
  - certctl_deploy_post_verify_failures_total{target_type}
  - certctl_deploy_rollback_total{target_type, outcome}
  - certctl_deploy_idempotent_skip_total{target_type}
- Output sorted by target_type for stable diffs across requests.

The agent-side wire-up (cmd/agent/main.go ticking counters in the
DeployCertificate dispatch site) is intentionally deferred to a
follow-up commit — Phase 10's load-bearing change is the
infrastructure; per-connector tick wiring is a mechanical follow-on.

Build + go vet clean. go test -count=1 green for service +
handler packages.

Phase 11 next: cross-cutting integration tests at deploy/test/.
2026-04-30 15:25:38 +00:00
shankar0123 2d83342bbe feat(metrics): extend /metrics/prometheus with per-area OCSP counters (Phase 8)
Production hardening II Phase 8 — surface the OCSP per-event counters
shipped in Phase 1+2 through the existing /api/v1/metrics/prometheus
endpoint. Operators now alert on certctl_ocsp_counter_total
{label="rate_limited"} (Phase 3 trip), {label="nonce_malformed"}
(Phase 1 reject), {label="signing_failed"} (issuer connector fails),
etc.

NEW interface CounterSnapshotter (handler/metrics.go) — minimum
surface the Prometheus exposer needs from any per-area counter table:
just Snapshot() map[string]uint64. service.OCSPCounters.Snapshot
(Phase 1) satisfies it; future per-area counters (CRL, cert-export,
EST per-profile, SCEP per-profile, Intune per-profile) plug in the
same way as separate SetXxxCounters setters.

Naming convention per frozen decision 0.10:
  certctl_<area>_counter_total{label="<event>"} <value>

This commit ships only the OCSP block. The remaining areas (CRL,
cert-export, EST, SCEP, Intune) plug in via the same
SetXxxCounters pattern in follow-up commits — the wire-up cost per
area is one new field + one setter + one block of fmt.Fprintf lines.
The bundle's S-1 docs-count guard means we don't claim a specific
total in prose; operators run `curl /api/v1/metrics/prometheus | grep
certctl_` to enumerate.

Wired in cmd/server/main.go: a single shared *service.OCSPCounters
instance is created once and passed to BOTH the
ocspResponseCacheService (so the cache hot path ticks counters) AND
metricsHandler.SetOCSPCounters (so the Prometheus exposer reads
them). Existing dashboard metrics (certctl_certificate_total,
certctl_agent_total, etc.) remain unchanged at the same line offsets
— back-compat preserved.

Pre-commit verification: go build ./... clean; go test -short
-count=1 green for handler/ + service/. The existing
TestGetPrometheusMetrics_Success tests still pass (the new counter
block is additive at the END of the response body, after the
existing dashboard metrics + uptime line).
2026-04-30 05:15:05 +00:00
shankar0123 db854ecc6f feat(crl): HTTP caching headers (ETag + If-None-Match 304) per RFC 7232 (Phase 4)
Production hardening II Phase 4 — wire RFC 7232 conditional-request
support into GetDERCRL so CDNs and reverse proxies in front of certctl
can serve repeated CRL fetches from edge caches. Saves bandwidth +
removes the per-request DB read on the certctl side when a relying
party honors max-age.

ETag: weak form (W/) per RFC 7232 §2.3 wrapping the first 16 bytes
of SHA-256(DER) — sufficient ID space for the cache layer + leaves
headroom for a future builder that might emit signature randomness
that doesn't change the CRL semantics.

If-None-Match: when the inbound header matches the computed ETag,
short-circuit to 304 Not Modified with no body. Identical inbound
ETag → identical CRL → no need to retransmit the bytes.

Cache-Control: public, max-age=3600, must-revalidate. The 1h max-age
matches the default CRL regen cadence; relying parties that cache
won't re-fetch within the window. must-revalidate forces revalidation
once the window expires (so a stale relying party doesn't keep
returning expired-cache CRLs after the regen tick).

The pre-existing Cache-Control: max-age=3600 is preserved
syntactically (the new line replaces it with the more complete form);
existing relying parties see the same ceiling, just with the addition
of public + must-revalidate hints for downstream caches.

Pre-commit verification: go build ./... clean; go test -short
-count=1 green for handler/. The existing TestGetDERCRL_* tests still
pass — the new headers are additive, the response body is unchanged.
2026-04-30 05:09:28 +00:00
shankar0123 ed19312df6 feat(ratelimit): per-endpoint rate limit on OCSP + cert-export (Phase 3)
Production hardening II Phase 3 — wire the existing
internal/ratelimit/SlidingWindowLimiter into the OCSP and cert-export
handlers. Removes the DoS vector where an unauthenticated relying
party (or compromised admin token) can hammer the responder /
key-export endpoint at unbounded rates.

OCSP: per-source-IP cap. Default 1000 req/min/IP, 50k tracked IPs
(matches the SCEP/Intune replay cache cap). Configurable via
CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN; zero disables. Source IP comes
from net.SplitHostPort(r.RemoteAddr) — we deliberately do NOT honor
X-Forwarded-For because OCSP is publicly reachable and untrusted
intermediaries could spoof the header to bypass the limit.

On rate-limit trip: respond with the canonical
ocsp.UnauthorizedErrorResponse pre-built blob from x/crypto/ocsp
(status 6 per RFC 6960 §2.3) plus Retry-After: 60. Using the
unauthorized status (instead of TryLater) avoids hand-rolling DER
for a single rejection path; relying parties retry on any non-good
status anyway.

Cert-export: per-actor cap. Default 50 exports/hr/operator.
Configurable via CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR; zero
disables. Actor extracted from the X-Actor request header (set by
the auth middleware); falls back to RemoteAddr if empty (defensive).

On rate-limit trip: HTTP 429 + JSON body
{"error":"rate_limit_exceeded","retry_after_seconds":3600} +
Retry-After: 3600.

NEW config fields in internal/config/config.go::SchedulerConfig:
  OCSPRateLimitPerIPMin (default 1000)
  CertExportRateLimitPerActorHr (default 50)

WIRED in cmd/server/main.go: ocspLimiter constructed with the
configured cap, 1m window, 50k map cap; exportLimiter same shape with
1h window. Both wired via SetOCSPRateLimiter / SetExportRateLimiter
on their respective handlers. Existing deploys see no behavior
change unless the env vars are set to non-default values + traffic
exceeds the cap.

Pre-commit verification: go build ./... clean; go test -short
-count=1 green for handler + service + config.
2026-04-30 05:08:04 +00:00
shankar0123 3d15a3e5af feat(ocsp): RFC 6960 §4.4.1 nonce extension support — echo client nonce in response, reject malformed
Production hardening II Phase 1.

The OCSP responder previously ignored the request's nonce extension
entirely, leaving relying parties vulnerable to replay attacks. RFC
6960 §4.4.1 defines the OPTIONAL id-pkix-ocsp-nonce extension (OID
1.3.6.1.5.5.7.48.1.2): when present in the request, the responder
MUST echo the same value in the response; when absent, no nonce in
the response (back-compat with relying parties that don't send one).

NEW internal/service/ocsp_nonce.go: ParseOCSPRequestNonce walks raw
DER (golang.org/x/crypto/ocsp.Request doesn't expose the request's
extensions field — the library only exposes IssuerNameHash +
IssuerKeyHash + SerialNumber). Returns one of three states:
  - (nil, false, nil) — no nonce extension in request
  - (nonce, true, nil) — well-formed nonce, ≤ MaxOCSPNonceLength (32)
  - (nil, false, ErrOCSPNonceMalformed) — empty or oversized

NEW internal/service/ocsp_counters.go: sync/atomic counter table for
OCSP request lifecycle (request_get/post, request_success/invalid,
nonce_echoed, nonce_malformed, rate_limited, ...). Mirrors the EST/
SCEP counter pattern; Phase 8 wires these into /metrics/prometheus.

CertSrv types extended:
  - internal/connector/issuer/interface.go::OCSPSignRequest gains
    Nonce []byte field.
  - internal/service/renewal.go::OCSPSignRequest (the service-layer
    duplicate used by ca_operations.go) gains the same field.
  - internal/service/issuer_adapter.go bridges the two.

Service path: CAOperationsSvc.GetOCSPResponseWithNonce(ctx, issuerID,
serialHex, nonce) is the new entry point that plumbs the nonce
through every signing site (good / revoked / unknown / short-lived).
The legacy GetOCSPResponse becomes a nil-nonce wrapper for back-
compat — every existing caller (tests, the GET handler) sees no
behavior change.

CertificateService gains the same WithNonce variant; the handler
interface adds it to the contract. MockCertificateService in tests
extended with the new method (delegates to the legacy fn when no
override is set, so existing tests that don't care about the nonce
keep working).

Local issuer's SignOCSPResponse appends the id-pkix-ocsp-nonce
extension (non-Critical per RFC 6960 §4.4) to the response template's
ExtraExtensions when req.Nonce != nil. The extnValue is the nonce
bytes wrapped in an OCTET STRING per RFC 6960 §4.4.1.

POST OCSP handler (HandleOCSPPost):
  - After ocsp.ParseRequest succeeds, calls ParseOCSPRequestNonce on
    the raw body to extract the optional nonce.
  - On ErrOCSPNonceMalformed (empty or > 32 bytes): writes an
    'unauthorized' OCSP response (status 6 per RFC 6960 §2.3) using
    the canonical ocsp.UnauthorizedErrorResponse from x/crypto/ocsp.
    Does NOT echo malicious bytes back.
  - On well-formed nonce: passes it through GetOCSPResponseWithNonce.
  - On no nonce: nil passed through; back-compat preserved.

GET OCSP handler unchanged — the GET form has no body to carry a
nonce extension.

6 new tests in internal/service/ocsp_nonce_test.go pin every
documented failure mode + the 32-byte boundary. The test fixture
builds an OCSPRequest via golang.org/x/crypto/ocsp.CreateRequest then
splices in a [2] EXPLICIT Extensions element by hand (the library
doesn't expose extension construction either).

Pre-commit verification: gofmt clean, go vet clean across affected
packages, go test -short -count=1 green for service/ + handler/ +
connector/issuer/local/. No new env vars introduced (Phase 1 is
always-on per RFC; no operator opt-out).
2026-04-30 04:55:06 +00:00
shankar0123 5834e5b866 fix(est): plumb context through ESTService.ReloadTrust to satisfy contextcheck
CI golangci-lint v2.11.4 flagged internal/api/handler/admin_est.go:178:
the AdminESTServiceImpl.ReloadTrust method took ctx context.Context but
called svc.ReloadTrust() with no context, then the underlying
ESTService.ReloadTrust used context.Background() internally for the
audit RecordEvent call. That's the contextcheck linter's textbook
'context discarded at boundary' violation.

Fix: change ESTService.ReloadTrust signature to ReloadTrust(ctx
context.Context) and forward the caller-supplied ctx into
auditService.RecordEvent. AdminESTServiceImpl.ReloadTrust now passes
its received ctx through. The HTTP handler already forwards
r.Context() one layer up, so the request-scoped trace identifiers now
flow end-to-end into the audit row instead of being severed at the
service boundary.

Verified locally with golangci-lint v2.11.4 (the same version CI runs)
against ./internal/api/handler/... ./internal/service/... — '0
issues.' All cmd/* binaries build clean, go test -short -count=1
green for both packages.
2026-04-30 01:59:04 +00:00
shankar0123 5a682db8e2 EST RFC 7030 hardening master bundle Phases 10-11: libest sidecar e2e
+ Cisco IOS quirk fixtures + ManagedCertificate.Source provenance +
EST bulk-revoke endpoint + 13 typed audit action codes.

Phase 10.1 — libest reference-client sidecar:
- deploy/test/libest/Dockerfile: multi-stage Debian-bookworm-slim
  build of Cisco's libest v3.2.0-2 from source (autoconf/automake/
  libtool + libcurl4-openssl-dev + libssl-dev). Runtime stage
  carries only estclient + bash + openssl + ca-certificates so the
  exec surface stays small + predictable.
- docker-compose.test.yml libest-client entry (profiles: [est-e2e])
  with bind mounts for /config/est (test workspace) + /config/certs
  (certctl CA bundle for TLS pinning); IP 10.30.50.9 (10.30.50.8
  was already taken by certctl-agent).
- deploy/test/est/.gitkeep keeps the bind-mount target tracked.

Phase 10.2 — 5 integration tests (//go:build integration) in
deploy/test/est_e2e_test.go:
- TestEST_LibESTClient_Enrollment_Integration (cacerts → simpleenroll
  → cert-shape assertion)
- TestEST_LibESTClient_MTLSEnrollment_Integration (mTLS sibling-route
  cert auth; skip when bootstrap cert absent)
- TestEST_LibESTClient_ServerKeygen_Integration (RFC 7030 §4.4
  multipart; skip when profile gate disabled)
- TestEST_LibESTClient_RateLimited_Integration (4th enroll trips
  per-principal cap, asserts 429-shaped error)
- TestEST_LibESTClient_ChannelBinding_Integration (libest
  --tls-exporter; skip when libest build lacks the flag).
- requireESTSidecar guard skips the suite when the operator forgot
  --profile est-e2e; helpful error message includes the exact
  command to bring the sidecar up.

Phase 10.3 — Cisco IOS quirk fixtures + 3 unit tests in
internal/api/handler/cisco_ios_quirks_test.go:
- testdata/cisco_ios_15x_pem_csr.txt: PEM body sent with
  Content-Type application/x-pem-file. Handler dispatches on
  body-prefix not Content-Type — accepts cleanly.
- testdata/cisco_ios_16x_trailing_newline_csr.txt: extra trailing
  newlines after base64 body. strings.TrimSpace tolerates.
- testdata/cisco_ios_crlf_b64_csr.txt: CRLF-wrapped base64.
  base64.StdEncoding handles CRLF + LF identically.

Phase 11.1 — ManagedCertificate.Source provenance:
- New domain.CertificateSource enum (Unspecified/EST/SCEP/API/Agent).
- Migration 000023_managed_certificates_source.up.sql adds source
  TEXT NOT NULL DEFAULT '' so existing rows scan as
  CertificateSourceUnspecified — back-compat: bulk-revoke filter
  treats empty as "any source".
- Postgres repo Insert/Update/scan paths all wire the new column.

Phase 11.2 — EST bulk-revoke endpoint:
- BulkRevocationCriteria.Source field (Source-only requests rejected
  as too broad — must accompany at least one narrower criterion).
- service.bulk_revocation.resolveCertificates post-filter by Source
  (empty=any, no SQL change so existing CertificateFilter callers
  unaffected).
- New BulkRevocationHandler.BulkRevokeEST method pins Source=EST +
  dispatches; new route POST /api/v1/est/certificates/bulk-revoke
  (M-008 admin-gated). openapi.yaml documented + parity-guard green.

Phase 11.3 — 13 typed audit action codes in
internal/service/est_audit_actions.go:
- est_simple_enroll_success / _failed
- est_simple_reenroll_success / _failed
- est_server_keygen_success / _failed
- est_auth_failed_basic / _mtls / _channel_binding
- est_rate_limited
- est_csr_policy_violation
- est_bulk_revoke
- est_trust_anchor_reloaded
- ESTService.processEnrollment + SimpleServerKeygen + ReloadTrust
  split-emit BOTH the legacy bare action codes (back-compat for the
  GUI activity-tab chip filters that match by exact string +
  existing audit-log analysers) AND the new typed _success / _failed
  variants (operator grep target + per-failure-mode counter).

Tests:
- internal/api/handler/bulk_revocation_est_test.go — 5 cases
  (admin-true happy path pins Source=EST + non-admin 403 +
  empty-criteria 400 + invalid-reason 400 + method-not-allowed).
- internal/service/est_audit_actions_test.go — 5 cases (SimpleEnroll
  legacy+typed emission / SimpleReEnroll typed / IssuerError
  typed-failed / PolicyViolation triple-emit /
  unique-string invariant).

Pre-commit verification (sandbox): gofmt clean, go vet clean
(excluding repository/postgres testcontainers limit), staticcheck
clean across api/handler/api/router/domain/service/deploy/test,
go test -short -count=1 green for every non-postgres Go package +
integration build (`go build -tags integration ./deploy/test/...`)
clean. G-3 docs-drift guard reproduced locally clean (Phases 10-11
added zero new env vars).

Spec preserved at cowork/est-rfc7030-hardening-prompt.md. Phases
12-13 (docs/est.md + WiFi/802.1X / IoT bootstrap / FreeRADIUS
recipes; release prep + tag) remain — post-2.1.0 work.
2026-04-30 00:52:43 +00:00
shankar0123 43075a1b5c EST RFC 7030 hardening master bundle Phases 5-7: end-to-end serverkeygen
+ profile-driven csrattrs + admin observability with per-status
counters + reload-trust endpoint.

Phase 5 — RFC 7030 §4.4 server-driven key generation:
- internal/pkcs7/envelopeddata_builder.go is the inverse of the
  existing parser/decryptor: AES-256-CBC content cipher + RSA PKCS#1
  v1.5 keyTrans + per-call random IV. Round-trip pinned in test
  (BuildEnvelopedData → ParseEnvelopedData → Decrypt returns the
  original plaintext byte-for-byte).
- ESTService.SimpleServerKeygen runs the full §4.4 flow: parse client
  CSR → require RSA pubkey for keyTrans → resolve per-profile
  algorithm (RSA-2048 default; honors AllowedKeyAlgorithms) → in-
  memory keygen → re-build CSR with server pubkey → run existing
  issuer pipeline → marshal PKCS#8 → CMS-EnvelopedData wrap to a
  synthetic recipient cert wrapping the device's CSR-supplied pubkey
  → zeroize plaintext + PKCS#8 bytes → return CertPEM + ChainPEM
  + EncryptedKey. Typed sentinels ErrServerKeygenRequiresKey-
  Encipherment / ErrServerKeygenUnsupportedAlgorithm /
  ErrServerKeygenDisabled.
- ESTHandler.ServerKeygen + ServerKeygenMTLS emit RFC 7030 §4.4.2
  multipart/mixed with random per-response boundary; per-profile
  SetServerKeygenEnabled gate returns 404 when off (defense in depth
  even if the route was registered).
- New routes POST /.well-known/est/[<PathID>/]serverkeygen +
  /.well-known/est-mtls/<PathID>/serverkeygen; openapi.yaml +
  openapi-parity guard updated.

Phase 6 — Real csrattrs implementation:
- New CertificateProfile.RequiredCSRAttributes []string + migration
  000022_certificate_profiles_csrattrs.up.sql. The migration also
  lands the previously-unwired must_staple column (closes the 5.6
  follow-up loop where the field shipped at the domain + service
  layer but the postgres scan/insert/update never persisted it).
- domain.EKUStringToOID + AttributeStringToOID lookup tables: id-kp-*
  EKUs (RFC 5280 §4.2.1.12) + RFC 5280 DN attributes + RFC 2985
  PKCS#10 attributes + Microsoft Intune device-serial OID.
- ESTService.GetCSRAttrs replaces the v2.0.x nil/204 stub with a
  profile-derived SEQUENCE OF OID ASN.1 marshal. Unknown EKU /
  attribute strings dropped + warning-logged so a typo doesn't take
  down the entire endpoint.

Phase 7 — Admin observability + counters + reload-trust:
- internal/service/est_counters.go: estCounterTab (sync/atomic; 12
  named labels) + ESTStatsSnapshot per-profile shape +
  ESTService.Stats(now) zero-allocation accessor + ReloadTrust()
  SIGHUP-equivalent + SetESTAdminMetadata setter.
- Counter ticks wired into processEnrollment + SimpleServerKeygen at
  every success/failure leg.
- internal/api/handler/admin_est.go mirrors AdminSCEPIntune verbatim:
  Profiles + ReloadTrust handlers + AdminESTServiceImpl. Both
  endpoints admin-gated (M-008 triplet pinned + admin_est.go added
  to AdminGatedHandlers).
- New routes GET /api/v1/admin/est/profiles + POST /api/v1/admin/
  est/reload-trust; openapi.yaml documented; openapi-parity guard
  reproduced clean.
- cmd/server/main.go grows estServices map populated by the per-
  profile EST loop + handed to AdminEST. New MTLSTrust() +
  HasMTLSTrust() accessors on ESTHandler so main.go can pull the
  trust holder for the admin-metadata wire-up.
- Per-profile counter isolation regression test
  (internal/service/est_profile_counter_isolation_test.go) proves
  a future shared-counter refactor would fail at compile-time
  pointer-identity check.

Pre-commit verification (sandbox): gofmt clean, go vet clean
(excluding repository/postgres which the sandbox can't build —
disk-space testcontainers download), staticcheck clean across
cms/trustanchor/api/handler/api/router/scep/intune/ratelimit/
service/pkcs7/domain/cmd/server, go test -short -count=1 green
for every non-postgres package. G-3 docs-drift guard reproduced
locally clean (Phases 5-7 added zero new env vars; Phase 1
already documented per-profile SERVER_KEYGEN_ENABLED).

Spec preserved at cowork/est-rfc7030-hardening-prompt.md. Phases
8-13 (GUI ESTAdminPage / CLI+MCP / libest e2e / bulk revocation /
docs/est.md / release prep) remain — post-2.1.0 work.
2026-04-29 23:57:45 +00:00
shankar0123 aa139ee0d9 EST RFC 7030 hardening master bundle Phases 2-4: end-to-end mTLS sibling
route + RFC 9266 channel binding + HTTP Basic enrollment-password +
per-source-IP failed-auth limit + per-(CN, sourceIP) sliding-window cap.

Two new shared packages so EST + Intune share infrastructure:
- internal/cms/ — RFC 9266 tls-exporter extractor (ExtractTLSExporter
  with stdlib-panic recovery for synthetic ConnectionStates) +
  CSR-side channel-binding parser via raw TBSCertificationRequestInfo
  walk (the stdlib's csr.Attributes can't represent the OCTET STRING
  binding value), VerifyChannelBinding composite, EmbedChannel-
  BindingAttribute fixture helper, typed sentinel errors for missing
  / mismatch / not-TLS-1.3 mapped to HTTP 400 / 409 / 426 in handler.
- internal/trustanchor/ — extracted from scep/intune/trust_anchor*.go
  so the EST mTLS sibling route + Intune dispatcher share the same
  SIGHUP-reloadable PEM bundle primitive. intune.TrustAnchorHolder
  is now `= trustanchor.Holder` (type alias) + NewTrustAnchorHolder =
  trustanchor.New (function alias) — every existing call site compiles
  unchanged. Intune's LoadTrustAnchor is a thin wrapper over
  trustanchor.LoadBundle. White-box tests moved to the new package.
- internal/ratelimit/ — extracted from scep/intune/rate_limit.go (this
  was Phase 4.1, in the same bundle). intune.PerDeviceRateLimiter
  is now a thin wrapper preserving the (subject, issuer)→key
  composition; EST handler reaches for SlidingWindowLimiter directly.

ESTHandler grew six optional fields wired by per-profile setters
(SetMTLSTrust / SetChannelBindingRequired / SetEnrollmentPassword /
SetSourceIPRateLimiter / SetPerPrincipalRateLimiter / SetLabelForLog)
plus four new mTLS-route methods (CACertsMTLS / SimpleEnrollMTLS /
SimpleReEnrollMTLS / CSRAttrsMTLS); shared internal pipeline
handleEnrollOrReEnroll(reEnroll, viaMTLS) keeps the auth/binding/
rate-limit gates DRY. New router method RegisterESTMTLSHandlers
registers /.well-known/est-mtls/<PathID>/{cacerts,simpleenroll,
simplereenroll,csrattrs}; AuthExemptDispatchPrefixes extends the
no-auth chain to /.well-known/est-mtls.

cmd/server/main.go's EST loop wires per-profile mTLS holder +
channel-binding policy + per-principal limiter + (when EnrollmentPassword
non-empty) Basic + source-IP limiter; new preflightESTMTLSClientCATrust-
Bundle returns *trustanchor.Holder so SIGHUP rotates the EST mTLS
bundle live without restart. SCEP + EST mTLS profiles now share a
single union mtlsUnionPoolForTLS passed to buildServerTLSConfigWithMTLS
(replaces the protocol-specific scepMTLSUnionPoolForTLS); per-handler
re-verify enforces "cert must chain to THIS profile's bundle" so
cross-protocol bleed is blocked at the application layer even though
the TLS layer trusts certs from either pool's union.

Phase 3.3 source-IP failed-Basic limiter defaults: 10 attempts / 1h
/ 50k tracked IPs (no env var; tunable in a follow-up). Phase 4.2
per-principal limiter cap from CERTCTL_EST_PROFILE_<NAME>_RATE_
LIMIT_PER_PRINCIPAL_24H (existing field, Phase 1 shipped).

New tests:
- internal/cms/channelbinding_test.go: extractor + CSR-side parser +
  composite + TLS-1.3 round-trip end-to-end + EmbedChannelBinding-
  Attribute round-trip
- internal/trustanchor/holder_test.go: parseBundlePEM white-box +
  LoadBundle + Holder Get/Pool/SetLabelForLog/Reload-happy/
  Reload-keeps-old-on-failure/Reload-keeps-old-on-expired/
  WatchSIGHUP-reloads-pool/WatchSIGHUP-stop-clean
- internal/api/handler/est_hardening_test.go: 16 named cases covering
  mTLS no-trust-pool 500 + no-cert 401 + cross-profile cert 401 +
  happy-path 200 + CACertsMTLS auth gate + CSRAttrsMTLS auth gate +
  channel-binding required-absent-rejected + not-required-absent-
  allowed + writeChannelBindingError mapping + Basic no-header 401
  + Basic wrong-password 401 + Basic correct-200 + Basic-no-password
  no-gate + per-IP failed-attempt lockout 429 + per-principal
  blocks-after-cap + different-principals-independent + no-limiter-
  unbounded.

Pre-commit verification (sandbox): gofmt clean, go vet clean
(excluding repository/postgres which the sandbox can't build —
disk-space testcontainers download), staticcheck clean for
cms/trustanchor/api/handler/api/router/scep/intune/ratelimit/
cmd/server, go test -short -count=1 green for cms/trustanchor/
api/handler/api/router/scep/intune/ratelimit/service. G-3
docs-drift guard reproduced locally clean (Phase 1 already
documented every new env var; Phases 2-4 added zero new env vars).
2026-04-29 23:15:35 +00:00
shankar0123 a808948397 feat(est): per-profile dispatch — multi-profile env-var family + back-compat shim
EST RFC 7030 hardening master bundle Phases 0 + 1 of 13. Lays the
foundation for the remaining hardening phases (mTLS auth, HTTP Basic
auth, channel binding, server-keygen, admin observability, GUI, libest
e2e) without changing existing operator behavior — backward-compat
shim preserves the v2.0.66 single-issuer flat env-var setup.

WHAT LANDS:

Phase 0 — Frozen decisions
  9 frozen decisions documented in
  cowork/est-rfc7030-hardening-prompt.md::Phase 0 frozen decisions
  (auth modes mTLS+Basic at GA; RFC 9266 channel binding; multi-profile
  env-var family CERTCTL_EST_PROFILES; mTLS sibling URL
  /.well-known/est-mtls/<pathID>; serverkeygen ships V2; fullcmc
  deferred; renewal device-driven per RFC 7030 §4.2.2; csrattrs
  algorithm allow-list profile-derived; libest as e2e reference).

Phase 1 — Multi-profile config + per-profile dispatch
  internal/config/config.go: extended ESTConfig with Profiles slice;
  added ESTProfileConfig struct with all field contracts (PathID +
  IssuerID + ProfileID + EnrollmentPassword + MTLSEnabled +
  MTLSClientCATrustBundlePath + ChannelBindingRequired +
  AllowedAuthModes + RateLimitPerPrincipal24h + ServerKeygenEnabled).
  Forward-looking fields (mTLS, HTTP Basic, channel binding,
  rate limit, server-keygen) are dormant in Phase 1 — Phase 2-5 wire
  the corresponding handlers; Validate() gates ensure operators can't
  set incoherent combinations (MTLSEnabled=true without bundle path,
  basic auth without password, mtls auth mode without MTLSEnabled,
  ChannelBindingRequired without mTLS, ServerKeygenEnabled without
  ProfileID).

  loadESTProfilesFromEnv: mirrors loadSCEPProfilesFromEnv exactly.
  Reads CERTCTL_EST_PROFILES=corp,iot,wifi and per-profile env vars
  CERTCTL_EST_PROFILE_<NAME>_*. Lowercase PathID, uppercase env-var
  name. parseAuthModes handles comma-separated normalization.

  mergeESTLegacyIntoProfiles: back-compat shim. When CERTCTL_EST_PROFILES
  is unset AND CERTCTL_EST_ENABLED=true, synthesizes a single-element
  Profiles[0] with PathID="" so existing /.well-known/est/
  operators see no behavior change.

  validESTPathID + validESTAuthMode: shape validators. PathID matches
  [a-z0-9-]+ with no leading/trailing hyphen (mirrors validSCEPPathID
  exactly). Auth mode is one of {mtls, basic}.

  Per-profile Validate(): refuses every documented misconfiguration
  with operator-greppable error messages naming the offending profile
  index + PathID + field. Mirrors the SCEP audit-closure pattern.

internal/api/router/router.go: refactored RegisterESTHandlers from
  single-handler to map[string]ESTHandler. Empty PathID maps to legacy
  /.well-known/est/ root (literal-string r.Register calls preserve
  openapi-parity scanner behavior). Non-empty PathIDs dynamic-register
  /.well-known/est/<pathID>/{cacerts,simpleenroll,simplereenroll,csrattrs}.
  Mirrors the SCEP per-profile dispatch from commit fdd424b.

cmd/server/main.go: refactored EST startup block to iterate
  cfg.EST.Profiles. Per-profile preflight (issuer-in-registry,
  preflightEnrollmentIssuer L-005 gate) runs in the loop with
  per-profile structured logging including PathID. Failures log the
  offending PathID so multi-profile deploys can pinpoint which broke
  startup. Mirrors the SCEP per-profile loop from commit fdd424b.

Updated 3 callers of the old single-handler signature:
  - internal/api/router/router_test.go::TestRegisterESTHandlers_AllPaths
  - internal/integration/lifecycle_test.go::setupTestServer
  - internal/integration/negative_test.go::setupTestServer
  Each wraps the existing single ESTHandler in a single-element
  map[string]handler.ESTHandler{"": estHandler} preserving exact
  legacy behavior.

NEW TESTS:

internal/config/config_est_profiles_test.go (12 tests):
  - LegacyFlatFields_SynthesizeSingleProfile (back-compat shim)
  - DisabledNoLegacyShim
  - MultipleProfiles_LoadFromEnv (3 profiles: corp+mtls+basic+keygen,
    iot+basic, wifi+mtls; verifies every field round-trips)
  - StructuredFormBeatsLegacy
  - PathIDValidation (12 sub-cases: empty/valid/leading-hyphen/
    trailing-hyphen/uppercase/slash/dot/underscore/space/percent)
  - DuplicatePathID_Refuses
  - MissingPerProfileIssuerID
  - MTLSEnabledRequiresBundlePath
  - ChannelBindingWithoutMTLS_Refuses (cross-check)
  - BasicAuthInModesRequiresPassword (cross-check)
  - MTLSAuthModeRequiresMTLSEnabled (cross-check)
  - UnknownAuthModeRefused
  - NegativeRateLimitRefused
  - ServerKeygenRequiresProfileID
  - DisabledIgnoresProfiles
  - ParseAuthModes_Normalization (8 sub-cases)

internal/api/router/router_est_profiles_test.go (4 tests):
  - LegacyEmptyPathIDMapsToRoot
  - NonEmptyPathIDMapsToSubpath
  - MultipleProfilesNoCrossBleed (the load-bearing dispatch invariant —
    each profile's PathID routes to its OWN handler instance,
    proven via per-profile-tagged mock responses with base64 prefix
    matching)
  - EmptyMapRegistersNoRoutes

VERIFICATION (sandbox, Go 1.25.9):
  gofmt -l                — clean for all changed files
  staticcheck             — clean for config + router + handler +
                            integration + cmd/server packages
  go vet                  — clean for the same packages
  go test -short -count=1 — green for config, router, handler,
                            service, integration, cmd/server

NEXT (Phase 2): mTLS client cert auth + TrustAnchorHolder + RFC 9266
tls-exporter channel binding. Phase 1's Validate gates already refuse
the incoherent configurations Phase 2 must defend against; Phase 2
adds the actual TLS-listener wiring + handler-side cert validation +
channel-binding extraction.

Spec preserved at cowork/est-rfc7030-hardening-prompt.md.
2026-04-29 22:17:52 +00:00
shankar0123 530593507b fix(scep-intune): close 11 audit gaps from 2026-04-29 pre-tag review
Closes the eleven gaps identified in the pre-v2.1.0 audit of the SCEP
RFC 8894 + Intune master bundle (cowork/scep-bundle-gap-closure-prompt.md).
Constitutional rule from cowork/CLAUDE.md::Operating Rules — 'Always
take the complete path, not the easy path' — drove this closure: each
gap was a load-bearing wire that crossed multiple layers (config →
validator → service wire-up → tests → docs) and shipping the bundle
without them would have produced lying-field footguns where operator-
visible config options stored values without affecting behavior.

WHAT LANDS:

Phase A — Clock-skew tolerance (master prompt §15 hazard closure)
  internal/scep/intune/challenge.go: ValidateChallenge migrated from
  positional args to ValidateOptions{} struct; new ClockSkewTolerance
  field with default 0 (strict). 24 call sites updated mechanically.
  Asymmetric application: now+tolerance >= iat AND now-tolerance < exp.
  internal/config/config.go: SCEPIntuneProfileConfig.ClockSkewTolerance
  default 60s + Validate() refusal when >= ChallengeValidity.
  cmd/server/main.go: SetIntuneIntegration signature extended;
  per-profile env-var loader honors CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CLOCK_SKEW_TOLERANCE.
  internal/service/scep.go: intuneClockSkew field + IntuneStatsSnapshot
  surfaces clock_skew_tolerance_ns. web/src/api/types.ts mirrors.
  4 new tests in challenge_test.go covering accept-within-tolerance,
  reject-beyond-tolerance, accept-expired-within-tolerance,
  negative-treated-as-zero defensive normalization.
  docs/scep-intune.md updated with the new env var + time-bounds rule.

Phase B — unknown-version-rejected golden test
  internal/scep/intune/golden_helper_test.go: goldenUnknownVersionPayload
  helper + signGoldenChallengeAny generic signer.
  challenge_golden_test.go: TestGoldenChallenge_UnknownVersionRejected
  uses an in-process ECDSA fixture (the on-disk PEM was generated with
  a Go-stdlib version that produces different ecdsa.GenerateKey bytes
  from the current call). TestRegenerateGoldenFixtures emits the new
  unknown_version fixture file too.

Phase C — Two named Intune e2e tests
  internal/api/handler/scep_intune_e2e_test.go:
    TestSCEPIntuneEnrollment_RateLimited_E2E (cap=2 + 3 attempts; 3rd
    returns FAILURE+badRequest with rate_limited counter ticked)
    TestSCEPIntuneEnrollment_TrustAnchorSIGHUPReload_E2E (rotate
    on-disk PEM + holder.Reload(); old-key challenge fails with
    badMessageCheck; signature_invalid counter ticked)
  intuneE2EFixture struct extended with trustHolder + trustPath fields
  so tests can rotate.

Phase D — Four new ChromeOS hermetic tests (10 total now)
  internal/api/handler/scep_chromeos_test.go:
    _RAKeyMismatch — PKIMessage encrypted to wrong RA cert; handler
      rejects without reaching service.
    _3DESBackwardCompat — RFC 8894 §3.5.2 legacy fallback verified.
    _RSACSR + _ECDSACSR — explicit matrix-pair pinning.
  buildTestECDSACSR helper for ECDSA P-256 CSR construction;
  tripleDESCBCEncrypt mirrors aesCBCEncrypt for 3DES-CBC;
  assertChromeOSPositiveCertRep shared assertion.

Phase E — Per-profile counter isolation test
  internal/api/handler/scep_profile_counter_isolation_test.go:
    TestSCEPHandler_PerProfileIntuneCountersIsolated wires two
    SCEPService instances + drives distinct PKIMessages + asserts
    counter isolation. Guards against a future cmd/server/main.go
    refactor that shares a *intuneCounterTab across profiles.
  buildPerProfileIntuneFixture parameterized helper.

Phase F — Server-boot regression tests
  cmd/server/preflight_scep_intune_test.go: 3 named tests covering
  disabled-backward-compat, broken-config-with-PathID, expired-cert
  refusal. preflightSCEPIntuneTrustAnchor signature extended with
  pathID arg so error messages carry PathID= for operator log-grep.

Phase G — docs/connectors.md
  Four new subsections under §EST/SCEP Integration: multi-profile
  dispatch + mTLS sibling route + Intune Connector dispatcher + SCEP
  probe in network scanner. Each has a one-paragraph operator
  explanation + an env-var or endpoint table.

Phase H — Coverage uplift
  internal/service/scep_probe_persist_test.go: 5 unit tests on
  persistProbeResult (nil-safe + nil-repo-safe + repo-error swallow +
  nil-logger guard) + ListRecentSCEPProbes (empty-slice-not-nil + repo
  pass-through) + describeCertAlgorithm (RSA/ECDSA/QF1008-nil-curve
  defensive branch/Ed25519/DSA/empty). CI gates (service ≥70, handler
  ≥75) PASS at 70.9% / 79.3%.

Phase I — deploy/test integration variant
  deploy/test/scep_intune_e2e_test.go (//go:build integration):
    TestSCEPIntuneEnrollment_Integration + _RateLimited_Integration
    against the live docker-compose certctl container. Skip-when-
    stack-missing semantics so sandbox + CI both work.
  deploy/docker-compose.test.yml: new e2eintune SCEP profile env
  vars + bind-mount of deploy/test/fixtures/.
  deploy/test/fixtures/README.md: documents the deterministic trust
  anchor regeneration recipe.

VERIFICATION (sandbox):
  gofmt -d        — clean for all changed files
  staticcheck     — clean for intune + handler + config + service +
                    cmd/server packages
  go vet          — clean for the same packages
  go test -short  — green for intune (95.3% cov), service (70.9%),
                    handler (79.3%), config (94.0%), cmd/server (boot
                    path; my preflight tests cover the directly-
                    testable function), pkcs7 (80.5% informational)

DEFERRED (per closure prompt §7 out-of-scope):
  - V3-Pro Conditional Access gating + Microsoft Graph integration
  - Standalone certctl-scan CLI binary
  - OCSP rate-limiting, OCSP stapling, delta CRLs

Spec preserved at cowork/scep-bundle-gap-closure-prompt.md;
journal at cowork/scep-rfc8894-intune/progress.md (audit-closure
section appended).
2026-04-29 20:28:53 +00:00
shankar0123 506cff137d feat(scep): SCEP probe in network scanner for fleet-readiness assessment
Phase 11.5 of the SCEP RFC 8894 + Intune master bundle. Adds an
operator-facing SCEP probe that issues GetCACaps + GetCACert against
an arbitrary SCEP server URL and returns a structured posture snapshot
(reachable + advertised caps + RFC 8894 / AES / POST / Renewal /
SHA-256 / SHA-512 support flags + CA cert subject + issuer + NotBefore
+ NotAfter + days-to-expiry + algorithm + chain length).

Two operator use cases per the master prompt:

  1. Pre-migration assessment — probe an existing EJBCA / NDES SCEP
     server before switching to certctl to see what capabilities it
     advertises and what the CA cert looks like.
  2. Compliance posture audits — periodic ad-hoc probes against the
     operator's own SCEP servers to flag drift.

Capability-only — does NOT POST a CSR per the spec (would consume slot
allocations on the target server + create audit noise). Standalone CLI
binary explicitly out of scope (per the master prompt §11.5.6 and the
operator's confirmation): the probe code lands inside certctl; a
future thin Cobra wrapper is a separate decision.

Backend (six new + one extended file):

  * internal/domain/network_scan.go — new SCEPProbeResult struct with
    every probe field documented for the GUI's display layer.

  * migrations/000021_scep_probe_results.up.sql + .down.sql — new
    scep_probe_results table with TEXT id, target_url, all probe
    flags, CA cert metadata, probed_at, probe_duration_ms, error.
    Two indexes: idx_scep_probe_results_probed_at (DESC) for the
    'recent probes' GUI query, idx_scep_probe_results_target_url
    (target_url, probed_at DESC) for the future per-URL history view.

  * internal/repository/interfaces.go — new SCEPProbeResultRepository
    interface (Insert + ListRecent).

  * internal/repository/postgres/scep_probe_results.go — Postgres
    implementation. ListRecent clamps limit to [1, 200]; on read
    re-derives ca_cert_days_to_expiry against the query-time wall
    clock so 'X days remaining' stays fresh.

  * internal/service/scep_probe.go — ProbeSCEP(ctx, url) on
    NetworkScanService. Validation order:
      1. Up-front URL validation via validation.ValidateSafeURL
         (defaults to validation.ValidateSafeURL but injectable for
         tests via the new scepValidateURL field on the service).
      2. Dial-time SSRF re-check via SafeHTTPDialContext on the
         http.Transport (defends against DNS rebinding).
      3. GET ?operation=GetCACaps + GET ?operation=GetCACert.
         GetCACert handles three response shapes: PKCS#7 SignedData
         certs-only envelope (multi-cert), raw DER (single-cert),
         and PEM-wrapped DER (non-conforming servers).
    Times out at 30s; uses a 1MB body cap for DoS defense; wraps
    the result + persists via the repo (nil-safe) before returning.
    describeCertAlgorithm helper returns 'RSA-N' / 'ECDSA-curve' /
    'Ed25519' / 'DSA' for the GUI's algorithm column.

  * internal/service/network_scan.go — added scepProbeRepo +
    scepHTTPClient + scepValidateURL + scepIDFn + nowFn fields;
    SetSCEPProbeRepo wires the repo at startup.

  * internal/api/handler/network_scan.go — extended NetworkScanService
    interface with ProbeSCEP + ListRecentSCEPProbes; added two new
    HTTP handlers:
      POST /api/v1/network-scan/scep-probe   (body {url})
      GET  /api/v1/network-scan/scep-probes  (recent history)
    Synchronous probe; HTTP 200 with the result body for both success
    and reachable-but-failed cases (so the GUI can render the failure
    tone with the operator-actionable error message).

  * internal/api/router/router.go — registered the two routes inline
    after the existing network-scan target endpoints.

  * api/openapi.yaml — documented both endpoints (operationId
    probeSCEP + listSCEPProbes) with full schema + response codes.

  * cmd/server/main.go — wires the new SCEPProbeResultRepository
    onto the network scan service via SetSCEPProbeRepo right after
    the existing NewNetworkScanService construction.

Backend tests (6 new — exit-criteria-named per the master prompt):

  * TestProbeSCEP_AdvertisesAllCaps — happy path, full RFC 8894
    capability set, ECDSA P-256 CA cert, 365-day expiry.
  * TestProbeSCEP_MissingSCEPStandard — pre-RFC-8894 server (only
    POSTPKIOperation + SHA-1 + DES3); SupportsRFC8894 = false.
  * TestProbeSCEP_GetCACertExpired — CA cert NotAfter 30d in the
    past; CACertExpired = true.
  * TestProbeSCEP_Unreachable — connect to TCP port 1; probe
    returns Reachable=false + non-empty Error.
  * TestProbeSCEP_RejectsReservedIP — http://169.254.169.254/scep
    (EC2 metadata literal) rejected by the up-front
    validation.ValidateSafeURL gate; result captures the error
    without ever issuing the HTTP call.
  * TestProbeSCEP_PEMWrappedCert — server returns PEM instead of
    raw DER for GetCACert; the fallback parse path handles it.

Frontend (one extended file + types/client):

  * web/src/api/types.ts — SCEPProbeResult + SCEPProbesResponse.
  * web/src/api/client.ts — probeSCEPServer + listSCEPProbes
    helpers.
  * web/src/pages/NetworkScanPage.tsx — new SCEPProbeSection
    component + ProbeResultPanel (with capability badges + CA cert
    details panel + raw caps line) + SCEPProbeHistoryTable. Form
    rejects empty URL with inline error before calling the API.
    Reload mutation goes through useTrackedMutation with explicit
    invalidates: [['scep-probes']] (M-009 contract).

Frontend tests (5 new + 0 regressions):

  * Scep probe section header + form renders.
  * Empty URL is rejected with inline error and never calls the
    probe endpoint.
  * Successful probe renders capability badges + CA cert subject
    + days-remaining inline panel.
  * Probe-level errors are surfaced in the inline panel (no result
    panel rendered).
  * Recent-probes history table renders one row per probe.
  * (Existing 2 NetworkScanPage XSS-hardening tests stub the new
    listSCEPProbes endpoint to an empty list so they still pass.)

Verification:
  * gofmt clean on touched files
  * go vet ./... clean
  * staticcheck on service+handler+router+repository+cmd-server clean
  * go test -short across service+handler+router+repository+cmd-server
    + integration: all green (existing + 6 new probe tests pass)
  * Frontend tsc --noEmit clean
  * Vitest: 7/7 NetworkScanPage tests pass (2 existing XSS + 5 new
    probe section)
  * G-3 docs-drift CI guard reproduced locally clean (no new env vars)
  * M-009 hard-zero useMutation guard clean (probe mutation goes
    through useTrackedMutation)
  * openapi-parity guard satisfied (both new routes documented)
  * The mockNetworkScanService in handler + integration packages
    extended with stub Probe methods; targeted coverage stays in
    scep_probe_test.go.

Out of scope (per master prompt §11.5.6 + operator confirmation):
  * Standalone certctl-scan CLI binary — separate decision, ~1d of
    follow-up work when/if shipped.

Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 11.5
      cowork/scep-rfc8894-intune/progress.md
2026-04-29 18:51:57 +00:00
shankar0123 0be889ff1d refactor(scep-gui): rebrand SCEP admin surface to per-profile tabbed interface (Profiles + Intune + Recent Activity)
Phase 9 follow-up to the SCEP RFC 8894 + Intune master bundle. The
Phase 9.4 GUI shipped 'SCEP Intune Monitoring' at /scep/intune, which
made the per-profile observability surface look Intune-only — operators
running EJBCA + Jamf would never click that nav link expecting per-
profile RA cert + mTLS observability. The page is per-profile keyed
under the hood; this commit rebrands + restructures so the surface
matches what operators actually need.

Spec: cowork/scep-gui-restructure-prompt.md.

User-visible change:

  - Nav link renamed: 'SCEP Intune' → 'SCEP Admin'.
  - Route: /scep is the new canonical path; /scep/intune kept as a
    backward-compat alias that lands directly on the Intune tab.
  - Page header: 'SCEP Administration'.
  - Three tabs:
      * Profiles (default) — per-profile lean cards with RA cert
        expiry countdown, mTLS sibling-route status badge, Intune
        enabled/disabled badge, challenge-password-set indicator.
        'View Intune details →' link on Intune-enabled cards
        deep-links into the Intune tab.
      * Intune Monitoring — the existing Phase 9.4 deep-dive
        (per-status counters, trust anchor expiry, recent failures
        table, reload-trust button + confirmation modal).
      * Recent Activity — full SCEP audit log filter merging all
        four action codes (scep_pkcsreq + scep_renewalreq +
        scep_pkcsreq_intune + scep_renewalreq_intune); chip filters
        for All / Initial / Renewal / Intune / Static.

Backend:

  * internal/service/scep.go — new SCEPProfileStatsSnapshot type +
    IntuneSection sub-block + ProfileStats(now) accessor. Adds
    raCertSubject/raCertNotBefore/raCertNotAfter + mtlsEnabled +
    mtlsTrustBundlePath fields with SetRACert + SetMTLSConfig setters.
    Existing IntuneStatsSnapshot + IntuneStats(now) preserved
    UNCHANGED for /admin/scep/intune/stats backward compat (the
    JSON shape stays byte-stable for external consumers — the
    aliasing approach the prompt initially suggested doesn't work
    because the new shape nests Intune while the old one is flat).
    ChallengePasswordSet is derived from challengePassword != ''
    (the secret value itself is never surfaced).

  * internal/api/handler/admin_scep_intune.go — new Profiles handler
    method on AdminSCEPIntuneHandler with the same M-008 admin gate.
    AdminSCEPIntuneServiceImpl extended (in place; same
    map[string]*service.SCEPService) to satisfy the new
    AdminSCEPProfileService interface. Single handler file gets the
    third method so the M-008 pin entry count stays steady (no new
    file, no new triplet of admin-gate test files — just three new
    Profiles tests inside the existing test file).

  * internal/api/router/router.go — one new route
    'GET /api/v1/admin/scep/profiles' registered to
    reg.AdminSCEPIntune.Profiles. HandlerRegistry unchanged.

  * api/openapi.yaml — new operation 'listSCEPProfiles' documenting
    the request body / response shape / error mapping. Existing
    Intune entries unchanged.

  * cmd/server/main.go — per-profile loop now calls
    scepService.SetMTLSConfig(profile.MTLSEnabled,
    profile.MTLSClientCATrustBundlePath) right after SetPathID, and
    scepService.SetRACert(raCert) right after loadSCEPRAPair returns
    the leaf cert. Both setters are nil-safe.

  * internal/api/handler/m008_admin_gate_test.go — extended the
    existing admin_scep_intune.go entry's justification to mention
    the third endpoint. No new map entry needed (file already
    listed).

Backend tests (8 new):

  * TestAdminSCEPProfiles_NonAdmin_Returns403
  * TestAdminSCEPProfiles_AdminExplicitFalse_Returns403
  * TestAdminSCEPProfiles_AdminPermitted_ForwardsActor — also pins
    that Intune-enabled profiles emit an 'intune' sub-block while
    Intune-disabled profiles OMIT it.
  * TestAdminSCEPProfiles_RejectsNonGetMethod
  * TestAdminSCEPProfiles_PropagatesServiceError
  * TestAdminSCEPProfilesServiceImpl_NilMapReturnsEmpty
  * (existing 16 Phase 9 admin tests still pass — backward-compat
    preserved)

Frontend:

  * web/src/api/types.ts — new SCEPProfileStatsSnapshot +
    IntuneSection + SCEPProfilesResponse types. Existing
    IntuneStatsSnapshot et al unchanged.
  * web/src/api/client.ts — new getAdminSCEPProfiles helper.
  * web/src/pages/SCEPAdminPage.tsx — full rewrite as the tabbed
    surface. Reuses the existing ConfirmReloadModal and Intune
    deep-dive card components verbatim; adds ProfileSummaryCard
    (lean card for the Profiles tab) and ActivityTab. URL state
    sync via useSearchParams so deep links survive reloads + browser
    back/forward. The legacy /scep/intune route alias defaults the
    activeTab to 'intune' on mount.
  * web/src/main.tsx — new <Route path='scep' /> + preserved
    <Route path='scep/intune' /> alias. Both render SCEPAdminPage.
  * web/src/components/Layout.tsx — nav link rebranded:
    label 'SCEP Intune' → 'SCEP Admin', to '/scep/intune' → '/scep'.

Frontend tests (20 — full rebuild):

  * Admin gate (non-admin sees gated banner + zero admin API calls)
  * Profiles tab default + Intune tab tabswitch + ?tab=intune deep
    link + legacy /scep/intune alias all land on Intune
  * Profiles tab status badges (Intune + mTLS + challenge-set)
    reflect each profile's flags
  * RA cert expiry tone bands (good ≥30d / warn 7-30d / bad <7d /
    EXPIRED) verified across three fixture profiles
  * 'View Intune details →' only renders for Intune-enabled
    profiles AND switches tabs on click
  * Empty-state banner when no profiles configured
  * Intune tab counters render with the existing Phase 9 deep-dive
    shape; reload modal Open/Confirm/Cancel/Error paths all pinned
  * Recent Activity tab merges all four SCEP audit actions across
    four parallel useQuery calls; filter chips
    (all/initial/renewal/intune/static) narrow correctly
  * Error path surfaces ErrorState on the active tab

Docs:

  * docs/scep-intune.md — Operational monitoring section heading
    expanded to '(SCEP Administration → Intune Monitoring tab)'.
    Page-surface description rewritten for the tabbed shape;
    admin-endpoints list extended with the new /admin/scep/profiles
    entry.
  * docs/architecture.md — Microsoft Intune Connector trust anchor
    subsection updated to reference the Intune Monitoring tab inside
    the SCEP Administration page + lists all three admin endpoints.
  * docs/legacy-est-scep.md — forward-ref expanded with a parallel
    sentence for the per-profile observability surface (independent
    of Intune).
  * README.md — Enrollment Protocols bullet for Intune updated to
    'admin GUI SCEP Administration page at /scep' with the three
    tabs called out.

Verification:
  * gofmt clean on touched files
  * go vet ./... clean
  * staticcheck on intune+service+handler+router+cmd-server clean
  * go test -short across intune+service+handler+router+cmd-server:
    all green (existing Phase 9 tests + new Profiles tests)
  * Frontend tsc --noEmit clean
  * Vitest: 20/20 SCEPAdminPage tests + 3/3 sibling AuditPage tests
    pass
  * G-3 docs-drift CI guard reproduced locally: clean (no new env
    vars; existing CERTCTL_SCEP_ allowlist prefix covers everything)
  * M-009 hard-zero useMutation guard reproduced locally: clean
    (the existing reload mutation already used useTrackedMutation
    from the Phase 9 follow-up commit 28e277a)
  * openapi-parity test green (new GET /api/v1/admin/scep/profiles
    operation documented)
  * M-008 admin-gate scanner green (existing admin_scep_intune.go
    entry covers all three handler methods; the test scanner
    enforces the triplet by file, not by endpoint, and the new
    Profiles triplet was added to the existing test file)

Backward compat preserved:
  * /api/v1/admin/scep/intune/stats unchanged — same JSON shape,
    same error codes, same M-008 gate
  * /api/v1/admin/scep/intune/reload-trust unchanged
  * /scep/intune route still works (alias to /scep with activeTab=intune)
  * IntuneStatsSnapshot Go type unchanged
  * IntuneStats(now) accessor unchanged

Refs: cowork/scep-gui-restructure-prompt.md
      cowork/scep-rfc8894-intune-master-prompt.md::Phase 9
      Phase 11.5 (SCEP probe in scanner — opt-in) and Phase 12
      (release prep + tag) of the master bundle resume after this.
2026-04-29 17:46:42 +00:00
shankar0123 e0d00717c7 feat(scep-intune): golden-file tests + e2e harness against fixture trust anchor
Phase 10 of the SCEP RFC 8894 + Intune master bundle. Adds reproducible
testdata fixtures + a hermetic end-to-end test that exercises the full
handler → service → dispatcher → CertRep wire path.

Phase 10.1 — Golden-file tests (internal/scep/intune/):

  * testdata/intune_trust_anchor.pem — deterministic ECDSA P-256 cert
    seeded from a constant byte string (sha256-derived PRNG); regenerates
    byte-identical PEM bytes across runs.
  * testdata/intune_challenge_golden_success.txt — valid challenge,
    iat/exp window covers goldenChallengeNow.
  * testdata/intune_challenge_golden_expired.txt — same trust anchor +
    payload shape but iat/exp shifted into the past.
  * testdata/intune_challenge_golden_tampered_sig.txt — payload bytes
    intact, last sig byte flipped.

  challenge_golden_test.go reads each fixture and asserts:
    - Success → ValidateChallenge returns a populated claim
      (DeviceName / Subject / SANDNS pinned to the documented values).
    - Expired → errors.Is(err, ErrChallengeExpired).
    - Tampered → errors.Is(err, ErrChallengeSignature).
    - Plus two defensive permutations: WrongAudienceReuse pins the
      audience-check ordering after a successful sig verify;
      RotatedTrustAnchorRejects pins the holder-rotation failure mode
      using a freshly-generated unrelated trust cert.

  golden_helper_test.go contains the deterministic-PRNG, ES256 signer,
  fixture-load helpers, and the regeneration target. Operators flip
  fixtures via:
    go test -run='^TestRegenerateGoldenFixtures$'             ./internal/scep/intune/... -args -update-golden

  Why ECDSA + a deterministic seed: a hand-pasted base64 blob would
  break on every Go stdlib bump (json.Marshal field ordering, ASN.1
  encoding edge cases). Generating from a pinned seed gives
  reproducible PEM bytes; only the ECDSA signature suffix varies
  across regenerations (Go's stdlib doesn't expose RFC 6979
  deterministic-k cleanly), and ValidateChallenge re-verifies the
  signature on every read so it doesn't matter.

  intune package coverage: 95.2% (was 94.8%).

Phase 10.2 — Hermetic end-to-end test (internal/api/handler/scep_intune_e2e_test.go):

  Departs from the spec's deploy/test/ location because the handler
  package already has the chromeOS-shape PKIMessage builders (buildTestCSR
  / buildEnvelopedDataForTest / buildSignedDataForTest / aesCBCEncrypt /
  postPKIOperation). Putting the e2e test in the handler package lets it
  reuse those helpers AND run in the default 'go test ./...' sweep —
  every CI run exercises the full Intune dispatcher chain. The
  deploy/test/ location is reserved for a future docker-compose-driven
  variant that would mount a fixture trust anchor into the running
  container; this hermetic version proves the wire works without that
  dependency.

  intuneE2EFixture stands up:
    - A real Intune Connector signing keypair (ECDSA P-256) + cert
      written to a temp PEM file the TrustAnchorHolder loads at startup.
    - A real RA pair the SCEPHandler decrypts EnvelopedData with.
    - A fixture issuer connector (intuneE2EIssuerConnector) that
      records every IssueCertificate call + returns a deterministic
      child cert chained to a fixture CA. Implements the full
      IssuerConnector interface (IssueCertificate / RenewCertificate /
      RevokeCertificate / GenerateCRL / SignOCSPResponse / GetRenewalInfo)
      with the non-issuance methods stubbed.
    - A capturing AuditRepository that records every Create call so
      the test can assert action='scep_pkcsreq_intune' was emitted.
    - A real SCEPService with SetIntuneIntegration wired to a real
      ReplayCache + PerDeviceRateLimiter.

  Three test scenarios:

    1. TestSCEPIntuneEnrollment_E2E — the documented happy path. Forge
       a valid Intune-shaped challenge (ES256 signed, length > 200, two
       dots — satisfies looksIntuneShaped), build a CSR with CN matching
       the claim's device_name, POST through HandleSCEP, decode the
       CertRep, assert pkiStatus=SUCCESS + issuer.issued has one entry
       + audit log carries 'scep_pkcsreq_intune' + IntuneStats.counters[
       'success']==1.

    2. TestSCEPIntuneEnrollment_ClaimMismatchRejected_E2E — same setup
       but CSR CN is 'attacker-host.example.com'. Dispatcher must
       reject with CertRep FAILURE+BadRequest (mapIntuneErrorToFailInfo:
       ErrClaimCNMismatch → BadRequest), no issuance, IntuneStats
       counters['claim_mismatch']==1.

    3. TestSCEPIntuneEnrollment_TamperedSignature_E2E — flip a byte in
       the JWT signature segment of the Intune challenge before
       wrapping it in the PKIMessage. Dispatcher rejects with
       FAILURE+BadMessageCheck (signature errors → BadMessageCheck per
       the same mapping table).

  Important sanity learning during construction: the buildTestCSR
  helper from scep_chromeos_test.go does NOT populate DNSNames on the
  CSR. The success claim therefore omits san_dns to avoid tripping
  ErrClaimSANDNSMismatch (claim says ['x'], CSR has nothing). The
  claim_mismatch sibling test exercises the SAN-dimension via the
  CN mismatch path; coverage of explicit SANDNS mismatches stays in
  the unit tests in claim_test.go where the helper builds CSRs with
  full SANs.

Verification:
  * gofmt clean on touched files
  * go vet ./internal/scep/intune/... ./internal/api/handler/...: clean
  * staticcheck: clean
  * go test -count=1 -cover ./internal/scep/intune/...: 95.2%
  * 5 golden tests + 3 e2e tests all pass
  * No new env vars (G-3 docs guard not triggered)
  * No new HTTP routes (openapi-parity guard not triggered)
  * Sibling test packages (service + router) still green

Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 10
      cowork/scep-rfc8894-intune/progress.md
2026-04-29 16:55:52 +00:00
shankar0123 77e0281a0e feat(scep-intune): GUI monitoring tab + admin endpoints
Phase 9 of the SCEP RFC 8894 + Intune master bundle. Lands the operator-
facing Intune Monitoring tab plus the two admin-gated endpoints it reads
from. Per the constitutional 'complete path' rule: counters tick on
every typed dispatcher branch, the GUI poll is live (30s for stats,
60s for the audit log filter), and the SIGHUP-equivalent reload action
is one click + a confirmation modal — no follow-up plumbing required.

Backend (Phase 9.1 + 9.2 + 9.3):

  * internal/service/scep.go gains:
    - intuneCounterTab — atomic per-status counters keyed by the same
      labels intuneFailReason() emits (success / signature_invalid /
      expired / not_yet_valid / wrong_audience / replay / rate_limited /
      claim_mismatch / compliance_failed / malformed / unknown_version).
      Lock-free on the dispatcher hot path; snapshot() returns a
      zero-allocation map for the admin endpoint.
    - dispatchIntuneChallenge wires intuneCounters.inc(...) on every
      typed return path INCLUDING the success leg (credited before
      processEnrollment so a downstream issuer-connector failure
      doesn't double-count).
    - SetPathID + PathID accessors (so admin rows surface the SCEP
      profile path ID per row).
    - IntuneStatsSnapshot + IntuneTrustAnchorInfo public types, plus
      IntuneStats(now) accessor that walks the trust holder pool and
      packages a per-profile snapshot. ReloadIntuneTrust() is the
      typed wrapper around TrustAnchorHolder.Reload that returns
      ErrSCEPProfileIntuneDisabled when called on a profile where
      Intune isn't enabled (admin endpoint maps that to HTTP 409).

  * internal/api/handler/admin_scep_intune.go:
    - AdminSCEPIntuneService narrow interface (Stats + ReloadTrust)
      so the handler depends on a small surface; AdminSCEPIntuneServiceImpl
      is the production walker over the per-profile SCEPService map.
    - AdminSCEPIntuneHandler.Stats handles GET /api/v1/admin/scep/intune/stats
      with the M-008 admin gate (non-admin → 403 + service never
      invoked); returns {profiles, profile_count, generated_at}.
    - AdminSCEPIntuneHandler.ReloadTrust handles POST
      /api/v1/admin/scep/intune/reload-trust. Body is {path_id: '<id>'};
      empty body targets the legacy /scep root profile. Returns 200 on
      success / 404 on unknown PathID / 409 when the profile is Intune-
      disabled / 500 on a parse error from intune.LoadTrustAnchor (the
      holder retains its previous pool — fail-safe). 400 on malformed
      JSON.
    - ErrAdminSCEPProfileNotFound typed error so the handler can
      distinguish 'wrong profile' from 'broken file'.

  * internal/api/router/router.go: HandlerRegistry gains
    AdminSCEPIntune; both routes registered as bearer-auth-required
    (the admin-gate is at the handler layer per the M-008 pattern).

  * cmd/server/main.go: declares scepServices map[string]*service.SCEPService
    BEFORE HandlerRegistry construction so the same map can be referenced
    from both the admin handler (constructed early) and the SCEP startup
    loop (which populates it later by reference). The per-profile loop
    now calls scepService.SetPathID(profile.PathID) and stores the service
    pointer into the shared map. AdminSCEPIntune handler is constructed
    at the same time as AdminCRLCache.

  * internal/api/handler/m008_admin_gate_test.go: AdminGatedHandlers
    map gains 'admin_scep_intune.go' with a one-line justification —
    the regression scanner enforces the per-handler test triplet
    (TestAdminSCEPIntune_NonAdmin_Returns403 + _AdminExplicitFalse_Returns403
    + _AdminPermitted_ForwardsActor) plus their POST siblings for
    ReloadTrust.

  * api/openapi.yaml: documents both endpoints with request body /
    response shape / error mapping; openapi-parity-test now matches
    the registered routes.

Frontend (Phase 9.4):

  * web/src/pages/SCEPAdminPage.tsx — single-page Intune Monitoring
    surface:
    - Per-profile cards (one card per SCEP profile). Enabled profiles
      get the full counter grid + trust-anchor-expiry badge tone
      (good ≥30d / warn 7-30d / bad <7d / EXPIRED). Disabled profiles
      get an off-state pill with the env-var hint to opt in.
    - Counters polled every 30s via TanStack Query against
      GET /admin/scep/intune/stats.
    - Recent failures table (last 50) populated from the audit log
      filtered to action=scep_pkcsreq_intune AND scep_renewalreq_intune;
      merged + sorted by timestamp descending. Polled every 60s.
    - Reload trust anchor button per profile + confirmation modal that
      explains the SIGHUP equivalence and the fail-safe behavior.
      onConfirm runs a TanStack mutation, refetches the stats query
      on success, surfaces the underlying error (eg 'trust anchor
      cert expired') in the modal on failure (modal stays open so
      operator can retry).
    - Admin gate: when authRequired && !admin the page renders an
      'Admin access required' banner and the underlying admin API
      requests are never issued (React Query enabled flag gated on
      auth.admin) — server-side enforcement is M-008.

  * web/src/api/types.ts: IntuneStatsSnapshot + IntuneTrustAnchorInfo +
    IntuneStatsResponse + IntuneReloadTrustResponse.

  * web/src/api/client.ts: getAdminSCEPIntuneStats +
    reloadAdminSCEPIntuneTrust(pathID).

  * web/src/main.tsx: new route /scep/intune. The route is unconditional;
    the gating is at the page level so deep-links land cleanly.

  * web/src/components/Layout.tsx: 'SCEP Intune' nav link between
    Observability and Audit Trail with the appropriate sidebar icon.

Tests (Phase 9.5):

  * internal/api/handler/admin_scep_intune_test.go (16 tests):
    - M-008 admin-gate triplet for both Stats (GET) and ReloadTrust
      (POST): NonAdmin / AdminExplicitFalse / AdminPermitted.
    - Method-gate tests (Stats rejects POST, ReloadTrust rejects GET).
    - Stats propagates service errors as 500.
    - ReloadTrust maps ErrAdminSCEPProfileNotFound→404,
      ErrSCEPProfileIntuneDisabled→409, generic err→500.
    - Empty body targets legacy root PathID.
    - Malformed JSON→400.
    - AdminSCEPIntuneServiceImpl handles nil map + unknown PathID.

  * web/src/pages/SCEPAdminPage.test.tsx (13 tests):
    - Admin gate (non-admin sees gated banner + zero admin API calls;
      admin sees the page; no-auth dev mode also passes).
    - Profile rendering (counters with correct labels, expiry badge
      tone for ≥30d / EXPIRED states, off-state pill for disabled
      profiles, empty-state banner when no profiles configured).
    - Reload modal (opens on click, calls mutation on Confirm,
      keeps modal open + shows error on failure, Cancel skips
      mutation).
    - Error path renders ErrorState with retry.
    - Audit log filter merges PKCSReq + RenewalReq events and sorts
      descending.

Verification:

  * gofmt clean on touched files
  * go vet ./... clean
  * staticcheck on intune/service/api/cmd-server clean
  * go test -short across api+service+intune+cmd-server: all green
  * web tsc --noEmit clean
  * Vitest: SCEPAdminPage.test.tsx 13/13 + sibling page suites all
    pass
  * G-3 docs-drift CI guard: Phase 9 adds no new CERTCTL_* env vars
    so the guard does not fire
  * openapi-parity-test green (both new admin endpoints documented)
  * M-008 regression scanner enforces the per-handler test triplet —
    pin updated, all triplets present

Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 9
      cowork/scep-rfc8894-intune/progress.md
2026-04-29 16:14:07 +00:00
shankar0123 a12a437664 feat(scep): mTLS sibling route /scep-mtls/<pathID> (opt-in)
SCEP RFC 8894 + Intune master bundle — Phase 6.5 of 14 (opt-in,
enterprise-procurement-checkbox).

Closes the procurement-team objection that 'shared password
authentication' is a checkbox-fail regardless of how strong the
password is. The clean answer: a sibling route that adds client-cert
auth at the handler layer AND keeps the challenge password (defense in
depth, not replacement). Devices present a bootstrap cert from a
trusted CA (e.g. a manufacturing-time cert), then SCEP-enroll for
their long-lived cert. Same model Apple's MDM and Cisco's BRSKI use.

internal/config/config.go
  * SCEPProfileConfig gains MTLSEnabled bool + MTLSClientCATrustBundlePath
    string. Indexed env-var loader reads
    CERTCTL_SCEP_PROFILE_<NAME>_MTLS_ENABLED +
    CERTCTL_SCEP_PROFILE_<NAME>_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH.
  * Validate() refuses MTLSEnabled=true with empty bundle path —
    structural defense in depth ahead of the file-content preflight.

cmd/server/main.go
  * preflightSCEPMTLSTrustBundle: file existence + PEM parse + ≥1
    CERTIFICATE block + non-expired check. Returns the parsed
    *x509.CertPool ready to inject into the per-profile SCEPHandler.
    Failures os.Exit(1) with the offending PathID in the structured log.
  * SCEP startup loop walks each profile; when MTLSEnabled, runs
    preflight, builds the per-profile pool, contributes the bundle's
    certs to the union pool that backs the TLS-layer
    VerifyClientCertIfGiven, clones the SCEPHandler with
    SetMTLSTrustPool, and registers the parallel sibling route via
    apiRouter.RegisterSCEPMTLSHandlers.
  * Union pool published to outer scope as scepMTLSUnionPoolForTLS;
    passed to buildServerTLSConfigWithMTLS so the listener serves both
    /scep[/<pathID>] (no client cert) and /scep-mtls/<pathID>
    (cert required at handler layer) on the same socket.
  * Final-handler dispatch gains /scep-mtls + /scep-mtls/* prefix
    routing through the no-auth chain (auth boundary is the client
    cert + challenge password, NOT a Bearer token).

cmd/server/tls.go
  * New buildServerTLSConfigWithMTLS that wraps buildServerTLSConfig
    + sets ClientCAs + ClientAuth=VerifyClientCertIfGiven when a
    non-nil pool is passed. nil pool = identical TLS shape to the
    pre-Phase-6.5 builder (no behavior change for deploys without
    mTLS profiles).
  * Critical: VerifyClientCertIfGiven (NOT RequireAndVerifyClientCert)
    so a client that doesn't present a cert can still hit the standard
    /scep route. The per-profile gate at the handler layer enforces
    'cert required' on /scep-mtls/<pathID>.

internal/api/handler/scep.go
  * SCEPHandler gains mtlsTrustPool *x509.CertPool field +
    SetMTLSTrustPool method. Per-profile pool injected by
    cmd/server/main.go after preflight.
  * HandleSCEPMTLS wrapper: gates on r.TLS.PeerCertificates non-empty
    + per-profile cert.Verify against THIS profile's pool. Returns
    HTTP 401 for missing/untrusted cert (mTLS failure is auth, not
    authorization). Returns HTTP 500 if mtlsTrustPool is nil (deploy
    bug — the route shouldn't have been registered). On success
    delegates to HandleSCEP — defense in depth: mTLS is additive,
    NOT replacement; the standard SCEP code path including the
    challenge-password gate still executes.
  * Per-profile re-verification via cert.Verify(...) is critical:
    the TLS layer verified against the UNION pool, so a cert that
    chains to profile A's bundle would pass TLS even when targeting
    profile B. The handler-layer gate prevents cross-profile
    bleed-through.

internal/api/router/router.go
  * AuthExemptDispatchPrefixes gains '/scep-mtls' (auth boundary is
    client cert + challenge password, NOT Bearer token).
  * RegisterSCEPMTLSHandlers parallel to RegisterSCEPHandlers:
    empty PathID maps to /scep-mtls root; non-empty maps to
    /scep-mtls/<pathID>. Each handler in the map MUST have had
    SetMTLSTrustPool called.

internal/api/router/openapi_parity_test.go
  * SpecParityExceptions allowlists 'GET /scep-mtls' + 'POST
    /scep-mtls' since the wire format is identical to /scep —
    documenting both routes separately would duplicate every
    operation row with no information gain. Documented alternative
    in docs/legacy-est-scep.md.

internal/api/handler/scep_mtls_test.go (new, ~210 LoC)
  * 6 tests + 2 helpers covering the auth contract:
    1. RejectsMissingClientCert — request with r.TLS=nil → 401
    2. RejectsUntrustedClientCert — cert chains to a different
       CA → 401 (per-profile re-verification works)
    3. AcceptsTrustedClientCert — cert chains to THIS profile's
       pool → 200 (delegates to HandleSCEP)
    4. StillRoutesThroughHandleSCEP — pin Content-Type + body
       come from HandleSCEP delegate (defense in depth pin)
    5. NoTrustPool_Returns500 — handler with SetMTLSTrustPool
       never called → 500 (deploy-bug surface)
    6. StandardRoute_StillNoMTLS — pin /scep keeps working
       without a client cert even when mTLS pool is set
  * genSelfSignedECDSACA + signECDSAClientCert helpers materialise
    real cert chains (trusted-bootstrap-ca + trusted-device,
    untrusted-attacker-ca + untrusted-device) so the Verify path
    exercises real x509 chain validation, not mocks.

docs/features.md
  * SCEP env-vars table extended with the two new MTLS env vars
    (CERTCTL_SCEP_PROFILE_<NAME>_MTLS_ENABLED,
    CERTCTL_SCEP_PROFILE_<NAME>_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH).
    Closes the G-3 'env var defined in Go but never documented' gate.

docs/legacy-est-scep.md
  * New 'mTLS sibling route (Phase 6.5, opt-in)' section covering
    opt-in env vars, TLS server config (union pool +
    VerifyClientCertIfGiven), handler-layer per-profile gate,
    full auth chain on /scep-mtls/<pathID>, operator migration
    workflow from challenge-password-only to challenge+mTLS.

cowork/CLAUDE.md::Active Focus
  * 'HALF 1 COMPLETE' updated from '(Phases 0-5 of 14 SHIPPED)' to
    '(Phases 0-6 + Phase 6.5 of 14 SHIPPED)'.

Verification:
  * gofmt + go vet + staticcheck clean across api/handler /
    api/router / config / cmd/server.
  * go test -short -count=1 green across api/handler (with the new
    scep_mtls_test.go) / api/router / service / config / pkcs7 /
    cmd/server / connector/issuer/local.
  * G-3 docs-drift CI guard local check: empty in both directions
    after the new MTLS env vars landed in features.md.
  * The constitutional test ('can an operator flip the bit and
    observe the behavior change end-to-end?') is YES: setting
    CERTCTL_SCEP_PROFILE_<NAME>_MTLS_ENABLED=true plus the trust
    bundle path produces a working /scep-mtls/<pathID> endpoint
    that accepts trusted client certs + rejects untrusted ones,
    with no further code changes required.

Phase 6.5 of 14 in SCEP RFC 8894 + Intune master bundle.
Half 1 (Phases 0-6 + 6.5) is now FEATURE-COMPLETE for the
ChromeOS / general-MDM use case. Half 2 (Phases 7-12) adds the
Microsoft Intune dynamic-challenge layer.
2026-04-29 13:58:18 +00:00