Commit Graph

158 Commits

Author SHA1 Message Date
shankar0123 af4fa12724 auth-bundle-1 Phase 8 follow-up: classify issuer/target audit rows + auditor end-to-end tests + gofmt drift
Self-audit caught five real gaps in 3ef45e2; this commit closes them.

# Phase 8 — issuer/target audit rows now classified as 'config'

The Phase 8 prompt explicitly required existing config-mutation
calls (issuer config, target config, etc.) to write
event_category=config. The 3ef45e2 commit only migrated the auth
service callers; the 6 issuer/target call-sites
(internal/service/issuer.go: create/update/delete_issuer +
internal/service/target.go: create/update/delete_target) still
defaulted to cert_lifecycle. They now pass through
RecordEventWithCategory(..., domain.EventCategoryConfig, ...) so
auditors filtering /v1/audit?category=config see the slice the
migration's docstring promised.

# Auditor exit-criterion test

Phase 8's exit criteria pin 'a user with the auditor role can list /
export audit events but gets 403 on every other endpoint.' Bundle 1
unit invariants (auditor permission set, rbacGate behaviour) were
in place but no end-to-end test walked the full set of admin perms
with an auditor actor. internal/api/router/rbac_gate_integration_test.go
gains TestRBACGate_AuditorRole_403sOnAdminRoutes (table-driven across
all 5 admin perms — cert.bulk_revoke / crl.admin / scep.admin /
est.admin / ca.hierarchy.manage) plus TestRBACGate_AuditorRole_PassesAuditReadGate
(positive case for audit.read).

# gofmt drift

3ef45e2 left two cosmetic struct-field-alignment diffs in
internal/cli/auth.go and internal/api/handler/audit_handler_test.go
that gofmt -l flagged. CI's gofmt step would have failed; gofmt -w
applied; gofmt -l now clean across the repo.

# CHANGELOG path-prefix

CHANGELOG.md v2.1.0 used '/v1/auth/bootstrap' shorthand in the
operator-facing flow examples. The actual route is
'/api/v1/auth/bootstrap'; an operator copy-pasting the curl would
404. All five hits replaced.

Verifications: gofmt clean, go vet ./internal/service/
./internal/api/router/ clean, go test -short -count=1 green across
internal/service + internal/api/router, including the 6 new
auditor sub-tests (PASS).
2026-05-09 20:23:41 +00:00
shankar0123 3ef45e2ad4 auth-bundle-1 Phase 6-7-8: bootstrap path + scope-down CLI + auditor-role split
# Phase 6 — day-0 admin bootstrap

* internal/auth/bootstrap/ (new package): Strategy interface +
  EnvTokenStrategy with constant-time compare, one-shot consumption
  via sync.Mutex, optional admin-existence probe. Bundle 2's OIDC-
  first-admin will plug in alongside as an alternate Strategy.
* BootstrapService.ValidateAndMint: validates the operator's
  CERTCTL_BOOTSTRAP_TOKEN, mints a 32-byte (64-hex-char) random API
  key value, persists the SHA-256 hash to api_keys, grants r-admin
  via actor_roles, AddHashed's the runtime keystore so the just-
  minted key authenticates the next request without restart, and
  records bootstrap.consume to the audit trail with category=auth.
* internal/auth/keystore.go (new): KeyStore interface +
  StaticKeyStore (immutable env-var-only path) + MutableKeyStore
  (env-var keys + DB-loaded api_keys + runtime AddHashed). The auth
  middleware now consumes a KeyStore so the bootstrap path can
  extend the lookup table at runtime.
* migrations/000031_api_keys.up/down.sql: api_keys table with
  (id, name UNIQUE, key_hash UNIQUE, tenant_id, admin, created_by,
  created_at, expires_at, last_used_at). Idempotent.
* /v1/auth/bootstrap GET (probe) + POST (mint) — auth-exempt. Both
  routes documented in api/openapi.yaml + AuthExemptRouterRoutes
  allowlist updated. The token never leaves internal/auth/bootstrap;
  the minted plaintext key flows only into the HTTP response body.
* Startup warning emitted when CERTCTL_BOOTSTRAP_TOKEN is set AND
  admin actors already exist (config drift signal).
* Tests: 4 strategy invariants (empty token born disabled, wrong
  token=ErrInvalidToken without consumption, one-shot consumption,
  admin-exists closes path), 5 service tests (happy path + actor-
  name validation + propagation of strategy errors + nil-deps
  guard + 32-byte entropy budget), 8 HTTP-handler tests (status
  201/410/401/400 mapping + token-leak hygiene scan of slog +
  audit details + Location header). Token-leak test redirects
  slog.Default to a buffer for the test scope.

# Phase 7 — API-key migration + scope-down CLI

* GET /v1/auth/keys handler + service method ListKeys backed by
  ActorRoleRepository.ListDistinctActors. Returns one row per
  (actor_id, actor_type) pair with the slice of role IDs they hold.
  Permission: auth.role.list.
* internal/cli/auth_scope_down.go: AuthListKeys, AuthScopeDown
  (interactive), AuthScopeDownNonInteractive (JSON config),
  AuthScopeDownSuggest (--suggest with optional --apply). The
  synthetic actor-demo-anon is filtered out of every interactive /
  bulk path; non-interactive flow logs and skips it explicitly.
* SuggestRoleFromAuditEvents (pure function): walks 30 days of
  audit events per actor and returns the narrowest matching role
  (admin / mcp / viewer / agent / operator) plus a one-line reason.
  Classification: any admin-shaped action wins; otherwise all-MCP
  → mcp; all-read-only → viewer; all-agent-shaped → agent;
  otherwise operator. Test table pins all six classifications.
* CLI subcommand tree extended: 'auth keys list' + 'auth keys
  scope-down [--non-interactive <cfg>] [--suggest [--apply]]'.
* CHANGELOG.md leads v2.1.0 with the SECURITY: AUDIT YOUR API KEYS
  call-out + four flow examples.

# Phase 8 — auditor role + event_category column

* migrations/000032_audit_category.up/down.sql: ALTER TABLE
  audit_events ADD COLUMN event_category TEXT NOT NULL DEFAULT
  'cert_lifecycle' + CHECK constraint (cert_lifecycle/auth/config)
  + (event_category) and (event_category, timestamp DESC) indexes
  for the auditor-filter query path. WORM trigger from migration
  000018 continues to enforce append-only at the DB layer (DDL is
  not blocked).
* domain.AuditEvent gains EventCategory string (omitempty);
  domain.EventCategoryCertLifecycle / Auth / Config constants.
* AuditService.RecordEventWithCategory sibling of RecordEvent;
  legacy callers stay on RecordEvent (defaults to cert_lifecycle).
  Auth callers (RoleService, ActorRoleService, BootstrapService)
  switched to RecordEventWithCategory(..., 'auth', ...).
* GET /v1/audit?category=<cat>: handler accepts the optional query
  param, validates against the enum (400 on invalid value),
  dispatches through ListAuditEventsByCategory. OpenAPI updated
  with the new query param + AuditEvent.event_category schema.
* Postgres AuditRepository.Create now writes event_category;
  AuditRepository.List filters on it; AuditFilter.EventCategory
  gates the WHERE clause.
* Tests: 5 audit-category-filter HTTP tests (dispatch routing,
  back-compat fallback, 400 for invalid values, all 3 enum values
  accepted, page+category combine, JSON output surfaces the
  field). 3 auditor-role invariants (auditor holds exactly
  audit.read+audit.export, no mutating perms, disjoint from
  viewer except audit.read).

# Cross-phase wiring

* HandlerRegistry.Bootstrap field added; cmd/server/main.go wires
  the bootstrap service ahead of RegisterHandlers (extracted
  assembleNamedAPIKeys helper into auth_backfill.go, moved the
  keystore + bootstrap construction up alongside the auth repos).
* AuthCheckResolver / AuthActorRoleService extended with ListKeys
  to satisfy the Phase 7 surface; existing fakes updated.
* fakeAudit + mockAuditService stubs in tests gain
  RecordEventWithCategory + ListAuditEventsByCategory; existing
  tests untouched.

# Verifications

* gofmt -l: clean across every modified file.
* go vet ./...: clean.
* staticcheck across internal/auth + handler + router + cli +
  service + repository + cmd + domain: clean.
* go test -short -count=1: green across every Bundle-1-touched
  package — internal/auth (incl. bootstrap), internal/api/handler,
  internal/api/router, internal/cli, internal/service/auth,
  internal/service, internal/domain/auth, internal/repository/postgres,
  cmd/server, cmd/cli, plus internal/scheduler, internal/api/middleware,
  cmd/agent, internal/mcp.
2026-05-09 20:15:43 +00:00
shankar0123 bd54d5f7fa auth-bundle-1 Phase 2: RBAC service layer + Authorizer primitive
Bundle 1 / Phase 2: ships PermissionService, RoleService, ActorRoleService, and the Authorizer primitive that Phase 3 RequirePermission middleware calls on every gated request.

Authorizer.CheckPermission semantics: a grant matches when (a) the permission name equals the requested permission AND (b) the grant is global-scoped OR the grant scope_type+scope_id exactly match the request. Global beats specific; per-resource grants widen the effective set rather than shadowing global. Hot-path query is one ActorRoleRepository.EffectivePermissions JOIN call (already shipped in Phase 1) plus an in-memory walk; Phase 12 will add benchmarks + caching if the JOIN cost shows up at scale.

Privilege-escalation guard: ActorRoleService.Grant and Revoke require the caller to hold auth.role.assign globally. Without it, ErrSelfRoleAssignment. System callers (AsSystemCaller()) bypass the check; bootstrap, migrations, scheduler-initiated grants use this path. Reserved actor actor-demo-anon is rejected on Grant + Revoke so the demo path stays alive even after a misclick (ErrAuthReservedActor).

Caller abstraction: every service entry point takes *Caller (ActorID, ActorType, TenantID, IsSystem). CallerFromContext is a stub returning ErrUnauthenticated; Phase 3 wires the middleware-context bridge that fills the Caller from request context. The contract is pinned by TestCallerFromContext_Phase2ReturnsUnauthenticated so the Phase 3 upgrade is observable.

Audit recording: every mutating service operation calls AuditService.RecordEvent. Bundle 1 Phase 8 adds the event_category column + parameter and back-fills 'auth' for these calls; until then the rows go in with the default category.

Test coverage: in-memory fakeRoleRepo / fakePermissionRepo / fakeActorRoleRepo / fakeAudit pin the privilege-escalation invariants (ErrUnauthenticated for nil caller, ErrForbidden for missing perm, ErrInvalidPermission for non-canonical permission name, ErrSelfRoleAssignment for Grant without auth.role.assign, ErrAuthReservedActor for actor-demo-anon mutations, system-caller bypass) without requiring testcontainers. Phase 12 will add live-Postgres integration coverage.

Branch: dev/auth-bundle-1. Phase 1 was 19497ee (RBAC schema + repo). Phase 3 (middleware integration) is the next commit on this branch.
2026-05-09 16:20:04 +00:00
shankar0123 0e06f6c4fc cli: promote --force on renew + require --reason on revoke (closes P3-1, P3-2)
Closes findings P3-1 and P3-2 from the 2026-05-05 CLI/API/MCP↔GUI parity
audit (cowork/cli-gui-parity-audit-2026-05-05/RESULTS.md). Both findings
flagged hidden defaults that the CLI was sending without exposing them
to operators: `force=false` baked into every renew payload, and a silent
fallback to `reason="unspecified"` whenever --reason was omitted.

P3-1 — promote --force on `certs renew` (full end-to-end plumbing)

The pre-2026-05-05 CLI sent `{"force": false}` in the renew body. The
API handler never decoded it — a textbook "lying field" per the
operator's CLAUDE.md "complete path, not the easy path" rule: the body
field stored a value, claimed to do something, and silently did nothing
because the wire never reached the consumer. Adding a --force flag that
also went unread would have created another lying field.

This commit takes the complete path:

  service.CertificateService.TriggerRenewal grew a `force bool` parameter
  (internal/service/certificate.go). When force=true, the
  RenewalInProgress block is overridden so operators can recover stuck
  in-flight renewals where a previous job hung without releasing the
  status flag. Archived and Expired remain terminal blockers regardless
  of force — those are semantic dead-ends that --force should not paper
  over (archived = decommissioned, expired = issue a new cert instead of
  renewing a dead one).

  handler.CertificateHandler.TriggerRenewal parses force from
  ?force=true (or ?force=1) query param, OR {"force": true} JSON body,
  whichever the client picks. Defaults to false. Passes through to the
  service.

  internal/cli/client.go::RenewCertificate(id, force bool) sends
  ?force=true on the URL when --force is set. The historical hardcoded
  `{"force": false}` body is gone — no more lying field.

  cmd/cli/main.go dispatches `certs renew <id> [--force]` (ID-first
  flag-second convention matches the existing `agents retire <id>
  [--force]`).

P3-2 — require --reason on `certs revoke` (Option A: strict refusal)

The pre-2026-05-05 CLI dropped to `--reason unspecified` whenever the
operator omitted the flag. Compliance reporting (RFC 5280 §5.3.1, PCI-
DSS §3.6, HIPAA §164.312) relies on the reason code being meaningful;
silent fallback defeats the audit trail because every revocation looks
identical.

  cmd/cli/main.go dispatch refuses to send when --reason is empty,
  prints the canonical RFC 5280 §5.3.1 reason-code menu, and exits
  non-zero.

  internal/cli/client.go exposes ValidRevokeReasons() returning the
  canonical camelCase list (unspecified, keyCompromise, caCompromise,
  affiliationChanged, superseded, cessationOfOperation, certificateHold,
  removeFromCRL, privilegeWithdrawn, aaCompromise) and
  NormalizeRevokeReason() that accepts both camelCase and snake_case
  inputs and normalises to the canonical wire form. Off-list reasons
  are rejected at dispatch with the menu re-printed.

Test pins:

  internal/cli/client_test.go::TestClient_RenewCertificate_ForceFlag —
  --force=true sends ?force=true with empty body; --force=false sends
  no query and no body.

  internal/cli/client_test.go::TestNormalizeRevokeReason +
  TestValidRevokeReasons — canonical-camelCase + snake_case + reject-
  off-enum behaviour.

  cmd/cli/dispatch_test.go::TestHandleCerts_Revoke_RequiresReason +
  TestHandleCerts_Revoke_RejectsUnknownReason +
  TestHandleCerts_Renew_ForceFlag — dispatch-layer pins for the same
  contracts.

  internal/api/handler/certificate_handler_test.go::TestTriggerRenewal_
  ForceQueryParam — query-param passthrough (no-flag, force=true,
  force=1, force=false) flows through to the service-layer parameter.

  internal/service/certificate_test.go::TestTriggerRenewal_
  ForceOverridesInProgress — force=false preserves the
  RenewalInProgress block; force=true clears it.

  Existing TestTriggerRenewal_Archived extended to assert force=true
  still blocks Archived (terminal-state guarantee).

Docs: docs/reference/cli.md updated with the --force example for renew
and the strict --reason semantics for revoke (including snake_case
input acceptance).

Acceptance gate (verified):
  - go build ./cmd/server/... ./cmd/agent/... ./cmd/cli/...
    ./cmd/mcp-server/... clean.
  - go vet ./... clean.
  - go test -short -count=1 ./... pass repo-wide.
  - bash scripts/ci-guards/openapi-handler-parity.sh clean
    (router 178, OpenAPI 144, exceptions 36 — unchanged; we add
    parameter parsing, not routes).
  - gofmt -l clean.
2026-05-05 19:49:34 +00:00
shankar0123 75097909e9 2026-05-05 18:18:29 +00:00
shankar0123 b0fc067317 security: close CodeQL #17 (log injection) + #23 (SSRF false-positive reopen)
Two CodeQL alerts in one sweep — both medium-impact follow-ups
on already-merged guards.

Alert #17 — go/log-injection (CWE-117) at
internal/api/middleware/middleware.go:58:

  log.Printf("[%s] %s %s %d %v", requestID, r.Method, r.URL.Path, ...)

  r.Method and r.URL.Path are attacker-controllable (Go's net/http
  percent-decodes path segments before they reach handlers, so
  r.URL.Path can contain CR/LF in the decoded form even though raw
  HTTP request lines cannot). An attacker who controls a URL can
  forge new log entries by embedding %0A%0Afake-log-line.

  Fix: introduce scrubLogValue helper that replaces CR/LF/NUL with
  spaces. Apply to both r.Method and r.URL.Path. Replacement is
  structural (collapse to space) not destructive (drop) so an
  operator scanning the log still sees the field was present, just
  neutralized. Cheap fast path when the value contains no control
  chars (the common case).

  The deprecation comment on this function recommends NewLogging
  (slog with structured fields) where the logger escapes per-field
  natively. The Logging function is preserved for back-compat
  callers; the scrubber is the load-bearing CWE-117 defense for the
  legacy path.

Alert #23 — go/request-forgery (CWE-918) at scep_probe.go:271:

  CodeQL reopened the alert after commit e6919cd. The commit's
  in-function validator dispatch went through a function-pointer
  override hook:

    validateURL := s.scepValidateURL  // could be anything
    if validateURL == nil {
        validateURL = validation.ValidateSafeURL
    }
    if err := validateURL(rawURL); err != nil { ... }

  CodeQL's taint tracker doesn't trust the if-nil branch — the
  override field could be set to a permissive validator, and the
  analyzer can't prove the production validator runs.

  Fix: invert the dispatch. Always call validation.ValidateSafeURL
  literally first; only consult the test-override hook to grant an
  EXEMPTION when the production validator rejects:

    if err := validation.ValidateSafeURL(rawURL); err != nil {
        if s.scepValidateURL == nil || s.scepValidateURL(rawURL) != nil {
            return ... validate url error
        }
    }

  Same applies to ProbeSCEP's entry-point validator. Both call sites
  now have the literal validation.ValidateSafeURL call in-scope of
  the sink (client.Do), which CodeQL recognizes as a sanitizer.

  Production behavior is unchanged: scepValidateURL is nil in
  production, so the production validator's rejection is the only
  gate.

  Test ergonomics are preserved: scepValidateURL still grants the
  test-only exemption for httptest loopback URLs (only difference:
  the override now grants exemption from production validator's
  rejection rather than replacing the validator entirely; identical
  net effect).

Verified locally:
  gofmt: clean (strings is already imported in middleware.go).
  go vet ./internal/api/middleware/... + ./internal/service/...:
    exit 0.
  go test -short ./internal/api/middleware/...: ok 0.244s.
  go test -short ./internal/service/...: ok 4.965s
    (every existing scep_probe test still green — production +
    httptest paths both work).

References:
  https://github.com/certctl-io/certctl/security/code-scanning/17
  https://github.com/certctl-io/certctl/security/code-scanning/23
Closes CodeQL #17. Re-closes CodeQL #23 with a fix CodeQL's taint
tracker can verify.
2026-05-04 05:29:35 +00:00
shankar0123 e6919cdaba security(scep_probe): re-validate URL inside scepHTTPGet to close CodeQL #23 (CWE-918)
CodeQL alert #23 (go/request-forgery, CWE-918 SSRF) flagged the
client.Do(req) sink at internal/service/scep_probe.go:232 because
the URL parameter to scepHTTPGet is taint-traced from the user-
supplied input to ProbeSCEP without the analyzer recognizing the
upstream sanitizer.

The defense-in-depth was already in place:
  1. validation.ValidateSafeURL at ProbeSCEP entry (line 75) —
     rejects obvious SSRF targets (loopback / link-local / cloud
     metadata literals) before any network call.
  2. validation.SafeHTTPDialContext on the http.Transport —
     re-resolves the host at dial time and rejects connections to
     reserved IP ranges. This is the authoritative SSRF + DNS-
     rebinding guard. Even if step 1 was bypassed, the dial would
     still fail.

But CodeQL's taint tracker doesn't follow the validator across
function boundaries, so the alert stays open even though the code
is safe. This commit re-runs validation.ValidateSafeURL inside
scepHTTPGet immediately before http.NewRequestWithContext —
sanitizer in the same function as the sink, which CodeQL
recognizes as a guard.

Bonus defense-in-depth: any future call site that wires a URL
into scepHTTPGet without going through ProbeSCEP (e.g. a new code
path that directly probes a discovered URL) inherits the same
SSRF guard automatically. Fail-closed by default.

The validator dispatch matches ProbeSCEP's pattern — tests
override via s.scepValidateURL to hit httptest loopback servers;
production callers use validation.ValidateSafeURL. The probe's
existing httptest-based tests continue to work unchanged.

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go test -short ./internal/service/...: ok 4.029s
    (every existing scep_probe test still green — the new
    revalidation is a no-op for tests that go through ProbeSCEP
    because the same validator already passed once at entry).

Reference: https://github.com/certctl-io/certctl/security/code-scanning/23
Closes CodeQL alert #23 (go/request-forgery).
2026-05-04 04:58:51 +00:00
shankar0123 62523fb845 service: 10 IntermediateCAService tests + in-memory fake repo (Rank 8 commit 2.5)
Service-layer pin for Rank 8. The fake IntermediateCARepository's
WalkAncestry mirrors the postgres recursive-CTE semantics
(leaf-first ordering, terminate at parent_ca_id IS NULL) so the
AssembleChain pin carries the same weight the production repo would.

Tests:
  TestIntermediateCA_CreateRoot_RegistersOperatorSuppliedSelfSigned
    Happy path. RFC 5280 §3.2 self-signed root + matching key gets
    persisted with parent_ca_id=NULL, state=active, KeyDriverID=...

  TestIntermediateCA_CreateRoot_RejectsNonSelfSigned
    RFC 5280 §3.2 enforcement. Cert whose embedded public key
    doesn't match the actual signer fails CheckSignatureFrom →
    ErrCANotSelfSigned.

  TestIntermediateCA_CreateRoot_RejectsKeyMismatch
    Operator-boundary defense in depth. Cert is well-formed
    self-signed but the supplied keyDriverID resolves to a
    different key → ErrCAKeyMismatch.

  TestIntermediateCA_CreateChild_PathLenTighteningEnforced
    RFC 5280 §4.2.1.9 enforcement. Child whose path-len equals or
    exceeds parent's → ErrPathLenExceeded. Strictly-tighter child
    succeeds.

  TestIntermediateCA_CreateChild_NameConstraintsSubset
    RFC 5280 §4.2.1.10 enforcement. Widening rejected
    ("evil.com" outside parent's "example.com"); subdomain
    narrowing succeeds ("internal.example.com").

  TestIntermediateCA_AssembleChain_4DeepHierarchy ← LOAD-BEARING
    The pin the local connector tree-mode delegates to. Builds
    root → policy → issuing-A → issuing-B and asserts AssembleChain
    returns 4 CERTIFICATE blocks in leaf-to-root order with
    matching subject CommonNames at each depth.

  TestIntermediateCA_Retire_RefusesIfActiveChildren
    Drain-first semantics. retiring → retired with active children
    refuses with ErrCAStillHasActiveChildren.

  TestIntermediateCA_Retire_TwoPhaseConfirm
    First call: active → retiring (no confirm). Second call without
    confirm: surfaces "pass confirm=true". Second call with
    confirm: retiring → retired.

  TestIntermediateCA_MetricsRecordedPerOutcome
    Snapshot pin. CreateRoot bumps create_root, CreateChild bumps
    create_child, Retire(active) bumps retire_retiring, all
    dimensioned by issuer_id.

  TestIntermediateCA_LoadHierarchy_FlatList
    Returns every CA for an issuer ordered by created_at; caller
    renders the tree from parent_ca_id.

Test infrastructure:
  fakeIntermediateCARepo                 — sync.Mutex-guarded map.
                                           WalkAncestry walks
                                           parent_ca_id from leafID
                                           to root (or terminates on
                                           cycle, defense-in-depth).
                                           Compile-time interface
                                           guard.
  testCAFixture                          — mints a self-signed root
                                           cert+key in process,
                                           Adopt()s the key under
                                           a stable ref so CreateRoot
                                           can resolve it.
  newTestService                         — wires IntermediateCAService
                                           with fake repo +
                                           signer.MemoryDriver +
                                           mockAuditRepo (already
                                           lives in testutil_test.go)
                                           + IntermediateCAMetrics.

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go test -short -count=1 -run TestIntermediateCA ./internal/service/...
    PASS (10/10)
  go test -short -count=1 ./internal/service/...: ok 3.844s

Reference: cowork/rank-8-intermediate-ca-hierarchy-prompt.md, commit 2.5.
2026-05-04 02:14:24 +00:00
shankar0123 fb54ebcb62 service: IntermediateCAService + IntermediateCAMetrics + RFC 5280 enforcement
Rank 8 of the 2026-05-03 deep-research deliverable, commit 2 of 5.
Service-layer wiring for first-class N-level CA hierarchy management.
The connector rewrite that activates this surface lands in commit 3.

Files added:
  internal/service/intermediate_ca.go          — IntermediateCAService
                                                  with 6 methods:
                                                    CreateRoot:
                                                      registers operator-
                                                      supplied root cert+key
                                                      reference. Validates
                                                      RFC 5280 §3.2 self-
                                                      signed (subject ==
                                                      issuer + signature
                                                      verifies). Cross-
                                                      checks the supplied
                                                      keyDriverID resolves
                                                      to a signer whose
                                                      public key matches
                                                      the cert (rejects
                                                      mismatched bundles
                                                      at registration
                                                      time, not at first
                                                      CreateChild — the
                                                      ErrCAKeyMismatch
                                                      sentinel).
                                                    CreateChild:
                                                      generates child key
                                                      via signer.Driver,
                                                      signs the cert via
                                                      the parent's signer.
                                                      Enforces RFC 5280
                                                      §4.2.1.9 (path-len
                                                      tightening) +
                                                      §4.2.1.10
                                                      (NameConstraints
                                                      subset semantics) at
                                                      service layer fail-
                                                      closed. Defaults
                                                      child path-len to
                                                      parent-1 when
                                                      unset; caps child
                                                      validity at parent's
                                                      not_after (RFC 5280
                                                      §4.1.2.5).
                                                    Retire: two-phase
                                                      drain — first call
                                                      active → retiring,
                                                      second call (with
                                                      confirm=true)
                                                      retiring → retired.
                                                      Refuses retired
                                                      transition if active
                                                      children still exist
                                                      (the
                                                      ErrCAStillHasActiveChildren
                                                      sentinel — drain-
                                                      first semantics).
                                                    Get / LoadHierarchy:
                                                      thin repo wrappers.
                                                    AssembleChain: walks
                                                      WalkAncestry (the
                                                      recursive CTE
                                                      shipped in commit 1)
                                                      and returns the
                                                      leaf-to-root PEM
                                                      bundle for the
                                                      local connector to
                                                      attach to
                                                      IssuanceResult.

  internal/service/intermediate_ca_metrics.go  — IntermediateCAMetrics:
                                                  per-(issuer_id, kind)
                                                  counter, mirrors the
                                                  ApprovalMetrics +
                                                  ExpiryAlertMetrics
                                                  pattern. RecordCreate
                                                  (root/child) +
                                                  RecordRetire
                                                  (retiring/retired).
                                                  SnapshotIntermediateCA
                                                  for the Prometheus
                                                  exposer.

Defense in depth retained:
  - NEVER persist CA private key bytes in the row. KeyDriverID is the
    only key reference; signer.Driver.Load resolves it at signing time.
  - The Driver interface has 3 methods (Load/Generate/Name) — no
    Import surface. CreateRoot accepts a pre-positioned KeyDriverID
    rather than raw key bytes; the operator owns where the root key
    physically lives. Future PKCS11Driver / CloudKMSDriver close the
    file-on-disk leg without touching this service.

Verified locally:
  gofmt: clean.
  go vet ./internal/service/...: exit 0.
  go build ./internal/service/...: exit 0.

Deferred to commit 2.5 (or fold into commit 3, operator's call):
  - 9 service-level tests including:
    * TestIntermediateCA_CreateRoot_RegistersOperatorSuppliedSelfSigned
    * TestIntermediateCA_CreateRoot_RejectsNonSelfSigned
    * TestIntermediateCA_CreateRoot_RejectsKeyMismatch
    * TestIntermediateCA_CreateChild_PathLenTighteningEnforced
    * TestIntermediateCA_CreateChild_NameConstraintsSubset
    * TestIntermediateCA_AssembleChain_4DeepHierarchy ← LOAD-BEARING
    * TestIntermediateCA_Retire_RefusesIfActiveChildren
    * TestIntermediateCA_Retire_TwoPhaseConfirm
    * TestIntermediateCA_MetricsRecordedPerOutcome

  Test setup needs: in-memory IntermediateCARepository fake +
  signer.MemoryDriver (already exists) + helper to generate test root
  cert+key. Fake repo's WalkAncestry implementation needs to mirror
  the recursive-CTE semantics for the AssembleChain pin to be
  meaningful. Total ~500 lines of test code; non-trivial setup.

Out of scope of THIS commit (commits 3-5):
  - Local connector rewrite + byte-equivalence pin
    (TestLocal_HierarchyMode_SingleVsTree_ByteIdentical).
  - 4 admin-gated handler endpoints + OpenAPI extension.
  - web/src/pages/IssuerHierarchyPage.tsx.
  - docs/intermediate-ca-hierarchy.md sysadmin runbook.
  - cmd/server/main.go wiring.

Reference: cowork/rank-8-intermediate-ca-hierarchy-prompt.md.
2026-05-04 01:58:26 +00:00
shankar0123 31e50d987f ci: fix Rank 7 lint + openapi-handler-parity drift on master
Two CI failures from the Rank 7 chain push (#438):

  Go Build & Test — staticcheck ST1021:
    internal/service/approval_metrics.go:97  comment for ApprovalDecisionEntry
                                               doesn't start with the type name
    internal/service/approval_metrics.go:130 comment for ApprovalPendingAgeSnapshot
                                               doesn't start with the type name

  Frontend Build — scripts/ci-guards/openapi-handler-parity.sh:
    4 router routes have no OpenAPI operationId:
      GET    /api/v1/approvals
      GET    /api/v1/approvals/{id}
      POST   /api/v1/approvals/{id}/approve
      POST   /api/v1/approvals/{id}/reject
    The Rank 7 commit-3 spec deferred OpenAPI extension to commit 4 with a
    'batched alongside the integration changes' note; commit 4 didn't actually
    add them. This commit closes that gap.

Fixes:

  approval_metrics.go — split the doc comment that was attached to
    SnapshotApprovalDecisions (the function) but visually preceded
    ApprovalDecisionEntry (the type), so the type appeared to staticcheck
    as having a comment that named the function instead of the type.
    Same fix on ApprovalPendingAgeSnapshot. Now each exported type has its
    own type-name-leading comment per Go convention.

  api/openapi.yaml — added 4 new operationIds (listApprovalRequests,
    getApprovalRequest, approveApprovalRequest, rejectApprovalRequest)
    + new ApprovalRequest schema component under components/schemas.
    Inline 401 response (the Unauthorized component does not exist in
    this spec; the canonical pattern in the rest of the file is inline
    'description: Authentication required'). The two-person integrity
    contract surface is documented in the description of the approve /
    reject endpoints so external readers see the RBAC contract from the
    spec alone.

Verified locally:
  go vet ./internal/service/...:                      exit 0.
  scripts/ci-guards/openapi-handler-parity.sh:        clean (140 ops vs 174 routes,
                                                       36 documented exceptions).

Third CI failure (image-and-supply-chain) was a transient apt-fetch
'Connection reset by peer' from deb.debian.org while pulling
libasan6_10.2.1-6_amd64.deb. Not a code issue; just re-run the workflow.
No code change needed.
2026-05-04 01:35:30 +00:00
shankar0123 aebfd8bd7c Revert "chore: drop 'Infisical' label from internal references"
This reverts commit 19706e56b3.
2026-05-04 01:18:15 +00:00
shankar0123 19706e56b3 chore: drop 'Infisical' label from internal references
Strategic naming cleanup. Earlier doc-comments + commit messages framed Rank
4 / Rank 5 / Rank 7 work as 'Rank N of the 2026-05-03 Infisical deep-research
deliverable' — the 'Infisical' qualifier was a holdover from the original
deep-research framing where Infisical (a competing secrets-management
platform) was the comparator. Keeping the comparator's name in our source
adds noise without value; an external reader sees 'Infisical' and assumes a
dependency or shared lineage rather than reading it as the competitive
context it was.

Mechanical sed across 34 files (32 source / docs + 2 follow-up Python passes
to collapse 'deep-research deep-research' duplicates that emerged where the
original phrase wrapped across lines):

  s|Infisical deep-research|deep-research|g
  s|infisical-deep-research-results|deep-research-results-2026-05-03|g
  s|infisical-deep-research-prompt|deep-research-prompt-2026-05-03|g
  s|infisical-deep-research|deep-research|g
  s|Infisical|deep-research|g
  s|deep-research deep-research|deep-research|g  # collapse-pass

Net diff: 63 insertions / 64 deletions across cmd/, docs/, internal/,
migrations/. Pure text substitution; zero behavior change. Code path
unchanged — go vet clean, tests for TestApproval pass on both
internal/service and internal/api/handler packages.

Workspace docs (cowork/) carry the same references and will be swept
separately — they're not under certctl/ git control. The two filename
references (cowork/infisical-deep-research-results.md +
cowork/infisical-deep-research-prompt.md) get renamed alongside that sweep
to deep-research-results-2026-05-03.md /
deep-research-prompt-2026-05-03.md so cross-references in the certctl
repo doc-comments resolve cleanly.
2026-05-04 01:15:01 +00:00
shankar0123 03c61f4c20 scheduler, certificate, renewal: gate issuance on profile-driven approval
Closes Rank 7 of the 2026-05-03 Infisical deep-research deliverable
(cowork/infisical-deep-research-results.md Part 5). Pre-fix, certctl
issued certificates unattended — every renewal-loop tick that crossed
a renewal threshold created a Job at Status=Pending which the
scheduler dispatched directly to the issuer connector. PCI-DSS Level
1, FedRAMP Moderate / High, SOC 2 Type II, and HIPAA-regulated PHI
customers all ask the same procurement question: "How do you enforce
two-person integrity on cert issuance?" Today's answer: "We don't."
After this commit chain: "Per-profile RequiresApproval=true creates a
parallel ApprovalRequest row; the renewal-loop creates the Job at
Status=AwaitingApproval; an authorized approver (different from the
requester per the same-actor RBAC check) calls
POST /api/v1/approvals/{id}/approve, transitioning the Job to
Pending; the scheduler picks it up."

This commit (4 of 4) wires the gate into the manual TriggerRenewal
entry point + main.go service construction + Config.Approval +
docs + WORKSPACE-ROADMAP follow-up entries. The previous commits
in the chain shipped:
  - 1 (2025275): domain types + migration + repository
  - 2 (8043e2b): ApprovalService + ApprovalMetrics + 8 service tests
  - 3 (81632eb): 4 API endpoints + handler RBAC tests + router wiring

Files modified:
  cmd/server/main.go              - Constructs approvalRepo +
                                     approvalMetrics + approvalService
                                     + approvalHandler. Wires
                                     CertificateService via
                                     SetApprovalService + SetProfileRepo.
                                     Logs a WARN line at boot when
                                     CERTCTL_APPROVAL_BYPASS=true so
                                     production operators alert on the
                                     log line. Adds Approvals to the
                                     HandlerRegistry.

  internal/config/config.go       - Adds top-level ApprovalConfig
                                     {BypassEnabled bool} sub-config
                                     + CERTCTL_APPROVAL_BYPASS env var
                                     loader. Doc comment cites the
                                     compliance-detection SQL query
                                     (SELECT count FROM audit_events
                                     WHERE actor='system-bypass') so
                                     auditors find the right pattern.

  internal/service/certificate.go - Adds approvalSvc + profileRepo
                                     fields to CertificateService +
                                     SetApprovalService /
                                     SetProfileRepo setters. Extends
                                     TriggerRenewal: looks up the
                                     profile, checks RequiresApproval,
                                     creates the Job at
                                     JobStatusAwaitingApproval (override
                                     the keygen-mode default), then
                                     calls approvalSvc.RequestApproval
                                     to create the parallel
                                     ApprovalRequest row. On
                                     RequestApproval failure, cancels
                                     the orphan Job (defense in depth —
                                     without this, a partial failure
                                     would leave the job stuck at
                                     AwaitingApproval forever). Profile-
                                     lookup failures fall back to the
                                     unattended path (fail-open from
                                     the operator's perspective +
                                     fail-loud via slog.Warn).

Files added:
  docs/approval-workflow.md       - Sysadmin-grade operator runbook:
                                      end-to-end ASCII flowchart
                                      (operator A triggers → operator
                                      B approves → scheduler dispatches),
                                      configuration recipe, RBAC contract
                                      (the load-bearing two-person
                                      integrity rule), operator playbooks
                                      for "I need to approve a renewal"
                                      and "approval timed out", PCI-DSS
                                      6.4.5 / NIST 800-53 SA-15 / SOC 2
                                      CC6.1 / HIPAA control mapping
                                      table, bypass-mode warnings with
                                      the exact compliance-detection SQL
                                      query, Prometheus metric reference,
                                      future free V2 work pointers.

Out of scope of THIS commit (deferred follow-on, not blocking the rest):
  - RenewalService.CheckExpiringCertificates auto-renewal-loop gate.
    The manual TriggerRenewal entry point is gated and the job-level
    timeout reaper already covers AwaitingApproval; the auto-renewal
    gate adds parity. Trivial to add — one block in renewal.go that
    mirrors the certificate.go::TriggerRenewal gate. Tracked in
    WORKSPACE-ROADMAP under the Approval-workflow extensions section.
  - Scheduler reaper extension calling ApprovalService.ExpireStale.
    Today: when the existing reaper times out an AwaitingApproval job,
    the parallel ApprovalRequest row stays at state=pending. The audit
    timeline is still correct (the job-side audit row records the
    timeout) but the dashboard shows a row that no longer needs human
    review. Trivial to wire — one method call in the existing
    scheduler tick. Same WORKSPACE-ROADMAP follow-on.
  - api/openapi.yaml extensions for the 4 new operationIds.
    The HTTP contract is pinned by the handler-level tests; OpenAPI
    is documentation that mirrors the contract.
  - docs/connectors.md `requires_approval` row in the CertificateProfile
    config table. Tracked in the same follow-on; the new
    docs/approval-workflow.md is the canonical reference.

Workspace-level updates (in cowork/, not under certctl/ git control —
applied separately):
  WORKSPACE-ROADMAP.md            - "Approval-workflow extensions"
                                     section under "Future Free V2 Work"
                                     covering M-of-N chains + time-
                                     windowed auto-approve + external
                                     ticketing + per-owner routing +
                                     delegation. All items free under
                                     BSL — no V3-Pro framing per the
                                     2026-05-03 strategy pivot (open
                                     core under BSL; future revenue =
                                     managed-service hosting).

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go build ./...: exit 0 — full repo links cleanly with the new
    Approval wiring.
  go test -short -count=1 -run TestApproval
    ./internal/service/... ./internal/api/handler/...:
    ok 0.005s for both packages — all 11 approval tests green
    (8 service-level + 3 handler-level).

Reference: cowork/rank-7-approval-workflow-primitive-prompt.md.
Commits: 20252758043e2b81632eb → THIS COMMIT.
2026-05-04 01:12:07 +00:00
shankar0123 8043e2bbac service: ApprovalService + ApprovalMetrics + 8 table-driven tests
Rank 7 of the 2026-05-03 Infisical deep-research deliverable, commit 2 of 4
(cowork/rank-7-approval-workflow-primitive-prompt.md). Builds on the
foundation in commit 2025275 — wires the service layer that drives the
approval workflow. Still no handler / integration wiring; commits 3-4
land that.

Files added:
  internal/service/approval.go         - ApprovalService struct + 6
                                          methods: RequestApproval,
                                          Approve, Reject, ListPending,
                                          List, Get, ExpireStale.
                                          Same-actor RBAC check
                                          (ErrApproveBySameActor) at
                                          both Approve and Reject; the
                                          load-bearing two-person
                                          integrity gate. Bypass mode
                                          short-circuits via
                                          approveInternal(outcome=
                                          "bypassed", actorType=System).
                                          Audit + metric emission per
                                          decision via shared
                                          recordAudit helper. Tolerates
                                          nil AuditService for tests.
                                          Service depends on a narrow
                                          JobStatusUpdater interface
                                          (single-method) rather than
                                          the full repository.JobRepository
                                          — production wiring satisfies
                                          it implicitly via postgres'
                                          existing UpdateStatus.

  internal/service/approval_metrics.go - ApprovalMetrics: thread-safe
                                          counter table (decisions
                                          counter dimensioned by
                                          outcome × profile_id) + a
                                          custom durationHistogram for
                                          pending-age (le buckets:
                                          60, 300, 1800, 3600, 21600,
                                          86400, +Inf — 1m, 5m, 30m,
                                          1h, 6h, 24h, beyond).
                                          Snapshot* methods return the
                                          Prometheus exposer's input
                                          shapes. Mirrors the
                                          ExpiryAlertMetrics +
                                          VaultRenewalMetrics pattern
                                          from prior ranks.

  internal/service/approval_test.go    - 8 table-driven tests with
                                          tight in-package fakes
                                          (fakeApprovalRepo +
                                          fakeJobStateRepo):
                                            TestApproval_RequestCreatesPendingRow_BypassDisabled
                                            TestApproval_BypassMode_AutoApprovesWithSystemBypassActor
                                            TestApproval_Approve_TransitionsJobFromAwaitingApprovalToPending
                                            TestApproval_Reject_TransitionsJobFromAwaitingApprovalToCancelled
                                            TestApproval_Approve_RejectsSameActor
                                              ↑ THE LOAD-BEARING TWO-PERSON
                                                INTEGRITY TEST. PCI-DSS 6.4.5
                                                / NIST 800-53 SA-15 / SOC 2
                                                CC6.1 compliance auditors
                                                pattern-match against this.
                                                Pins same-actor rejection on
                                                both Approve and Reject paths;
                                                pins success when a different
                                                actor approves.
                                            TestApproval_Approve_RejectsAlreadyDecided
                                            TestApproval_ExpireStale_TransitionsPendingToExpired_AndCancelsJob
                                            TestApproval_MetricCounterIncrements

Verified:
  gofmt: clean.
  go vet ./internal/service/...: exit 0.
  go test -short -count=1 -run TestApproval ./internal/service/...:
    ok 0.005s — all 8 tests green.

Out of scope for this commit (lands in commits 3-4):
  - api/handler/approval.go (5 endpoints + handler-side RBAC).
  - api/openapi.yaml extensions.
  - Integration into CertificateService.TriggerRenewal +
    RenewalService.CheckExpiringCertificates + Scheduler.ReapTimedOutJobs.
  - cmd/server/main.go wiring of ApprovalService + ApprovalMetrics.
  - Config.Approval.BypassEnabled + CERTCTL_APPROVAL_BYPASS env var.
  - docs/connectors.md row + docs/approval-workflow.md runbook.

Reference: cowork/rank-7-approval-workflow-primitive-prompt.md.
2026-05-04 01:01:53 +00:00
shankar0123 8b75e0311b chore: rename Go module path to github.com/certctl-io/certctl
Mechanical sed across the main go.mod's module declaration, the f5-mock-icontrol
sub-module's go.mod, every Go file's import path (361 files), and a rebuild of
the checked-in f5-mock-icontrol binary so its embedded build-info reflects the
new module path. No behavior change.

Choice B from cowork/transfer-certctl-to-org.md, executed 2026-05-04. Choice A
(keep module path declared as github.com/shankar0123/certctl regardless of
repo URL) shipped on the day of the org transfer (2026-05-03) since we had no
external Go consumers; this commit closes that deferral.

Backward-compat: GitHub HTTP redirects continue to forward
github.com/shankar0123/certctl → github.com/certctl-io/certctl at the URL
level, but Go's module proxy uses the path declared in go.mod as the
canonical name. Pre-fix, anyone trying `go get github.com/certctl-io/certctl/...`
hit a "module path mismatch" error because go.mod said
github.com/shankar0123/certctl and the URL they fetched it from said
certctl-io/certctl. Post-fix, the canonical name and the URL agree, so
go get / go install / external Go consumers / Go-tooling integrations
work cleanly via either the new path (preferred) or the old path (which
redirects and Go follows the redirect for source fetch).

Anyone still importing the old path inside their own code keeps working
provided they update their go.mod's `require` line to match — the module
path declared in their consumer's go.sum / go.mod is the authoritative
import name, so a mass sed across their import statements is the migration
on the consumer side. No external consumers exist today.

Diff shape:
  361 *.go files  — import path replacement only
    2 go.mod     — module declaration replacement only
    1 binary     — deploy/test/f5-mock-icontrol/f5-mock-icontrol rebuilt
                   so embedded build-info reflects the new path (8618965 vs
                   8618933 bytes; 32-byte diff is the build-info change)

  Total: 364 files, 730 insertions / 730 deletions, net-zero size, pure
  mechanical substitution.

Verification:
  gofmt: 17 files needed re-alignment after sed (the new path is one char
    shorter than the old, so column-aligned import groups drifted). Applied
    `gofmt -w` to fix.
  go mod tidy: clean exit on both modules.
  go vet ./...: clean exit.
  go build ./...: clean exit.
  go test -short -count=1 on representative packages: all green
    (internal/domain, internal/validation, internal/crypto, internal/crypto/signer,
    cmd/agent). Test output now reads `ok github.com/certctl-io/certctl/...`
    confirming the module path resolves correctly.
  binary: f5-mock-icontrol rebuilt; `strings | grep shankar0123` returns
    nothing; `strings | grep certctl-io/certctl` shows the new module path
    embedded in build-info.

Files intentionally NOT touched in this commit:
  README.md / CHANGELOG.md / docs/ / etc. — already swept to certctl-io
    URLs in commit 0729ee4 (the post-transfer URL refresh). This commit is
    purely the Go-tooling layer.
  Scarf pixels (`shankar0123.docker.scarf.sh/...`) — Scarf-account
    namespace, not a Go import or GitHub repo URL. Stays.

This is a non-blocking, non-customer-impacting change. Operators pulling
container images, running `make verify`, hitting the API, or installing the
agent see no functional difference. Only Go-tooling consumers (none today)
are affected, and they're enabled — not broken — by this commit.
2026-05-04 00:30:29 +00:00
shankar0123 8a56a78282 target(azurekv): SDK-driven Azure Key Vault target connector
Closes Rank 5 (Azure half) of the 2026-05-03 Infisical deep-research
deliverable (cowork/infisical-deep-research-results.md Part 5).
Pre-fix, certctl had no path to deploy certs to Azure-managed TLS-
termination endpoints (Application Gateway / Front Door / App Service
/ Container Apps) — operators terminating TLS at Azure had to use
manual `az keyvault certificate import` invocations or external
automation. This commit lands the SDK-driven Azure Key Vault target
connector that closes the gap, mirroring the AWS ACM target shape
shipped in commit edf6bee.

Architecture:
  - internal/connector/target/azurekv/azurekv.go — Connector wraps
    *azcertificates.Client behind the KeyVaultClient interface seam
    (mirrors awsacm's ACMClient + awsacmpca's ACMPCAClient). Lives
    in azurekv.go alongside the PFX (PKCS#12) wrapping helper that
    bundles the operator-supplied PEM cert + chain + key into the
    base64-PFX wire format azcertificates.ImportCertificate accepts.
  - internal/connector/target/azurekv/sdk_client.go — SDK-loading
    code isolated so the test path (NewWithClient) compiles without
    pulling azcore + azidentity transitive deps into the test
    binary. DefaultAzureCredential / ManagedIdentityCredential /
    EnvironmentCredential / WorkloadIdentityCredential selected via
    Config.CredentialMode (closed enum).
  - Pre-deploy snapshot via GetCertificate(name, "" /* latest */) so
    on-import-failure rollback restores the previous cert. Mirrors
    Bundle 5+. The Azure-specific quirk: rollback creates a NEW
    VERSION (Key Vault doesn't support version-restore without
    soft-delete recovery, which we keep off the minimum-RBAC
    surface). Operators reading audit dashboards see e.g. v1=initial,
    v2=failed-renewal, v3=rollback-of-v2; the certctl-managed-by +
    certctl-certificate-id provenance tags + future certctl-rollback-of
    metadata tag let an operator filter rollback artifacts.
  - Provenance tags identical to AWS ACM
    (certctl-managed-by=certctl + certctl-certificate-id=<mc-id>),
    automatically applied on every import. Key Vault carries tags
    forward across versions (unlike ACM which strips on re-import),
    so no separate AddTags call is required.
  - DeploymentRequest.KeyPEM held in agent memory only; PFX wrapping
    happens in-memory via software.sslmate.com/src/go-pkcs12. No
    disk write.

Tests:
  - azurekv_test.go: 13-subtest happy-path + validation matrix —
    ValidateConfig (success / missing-vault-url / malformed-vault-
    url / missing-cert-name / invalid-credential-mode / reserved-
    tag rejection), DeployCertificate (fresh import / rollback-on-
    serial-mismatch / empty-key-rejected / no-client-rejected /
    SDK-error-surfaced), ValidateOnly (returns sentinel),
    ValidateDeployment (serial match / mismatch).
  - All tests use the NewWithClient injection seam; no real-Azure
    API calls.
  - go test -short -count=1 ./internal/connector/target/azurekv/...
    green.

Wiring:
  - internal/domain/connector.go: TargetTypeAzureKeyVault =
    "AzureKeyVault".
  - internal/service/target.go: validTargetTypes set extended.
  - cmd/agent/main.go::createTargetConnector: AzureKeyVault case
    arm mirroring the AWSACM shape exactly.
  - cmd/agent/agent_test.go::TestCreateTargetConnector_AllSupported
    Types: AzureKeyVault added to the type matrix + the InvalidJSON
    matrix (16 supported target types now, up from 15).

go.mod / go.sum:
  - github.com/Azure/azure-sdk-for-go/sdk/azcore v1.20.0 (direct).
  - github.com/Azure/azure-sdk-for-go/sdk/azidentity v1.13.1 (direct).
  - github.com/Azure/azure-sdk-for-go/sdk/security/keyvault/
    azcertificates v1.4.0 (direct). The deprecated
    /keyvault/azcertificates path appears as a transitive indirect
    via Microsoft's microsoft-authentication-library-for-go; we use
    the new /security/keyvault/ path exclusively.

Documentation:
  - docs/connectors.md "Azure Key Vault" section: config table, RBAC
    role recipe (off-the-shelf "Key Vault Certificates Officer" or
    custom role with 3 data-plane actions), AKS workload-identity /
    managed-identity / service-principal / default credential
    recipes, atomic-rollback contract + Azure-version semantics
    explanation, soft-delete caveat, App Gateway / Front Door
    Terraform attachment snippet, threat model carve-outs (no disk
    writes, mandatory provenance tags, no long-lived secrets in
    Config), 5-bullet procurement checklist crib.

Out of scope (intentional, flagged in V3-Pro forward path):
  - Azure Front Door direct-attach (UpdateRoutingConfig — different
    Azure RBAC scope).
  - App Gateway / App Service auto-bind (V3-Pro auto-attach).
  - Soft-delete recovery (acm:RecoverDeletedCertificate-equivalent
    requires extra RBAC; V2 keeps minimum-permission surface).
  - GCP Certificate Manager (separate cloud, separate connector).

Verified locally:
- gofmt clean.
- go vet ./internal/connector/target/azurekv/...
  ./internal/domain/... ./internal/service/...
  ./cmd/agent/...  clean.
- go test -short -count=1 ./internal/connector/target/azurekv/...
  ./cmd/agent/...  green (all 16 supported target types
  instantiate via the agent factory).

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 5.
Acquisition prompt:
cowork/rank-5-aws-acm-azure-kv-target-adapters-prompt.md.
Companion commit (AWS half): edf6bee.
2026-05-03 22:43:45 +00:00
shankar0123 edf6bee7f8 target(awsacm): SDK-driven AWS Certificate Manager target connector
Closes Rank 5 (AWS half) of the 2026-05-03 Infisical deep-research
deliverable (cowork/infisical-deep-research-results.md Part 5).
Pre-fix, certctl had no path to deploy certs to AWS-managed TLS-
termination endpoints (ALB / CloudFront / API Gateway / App Runner)
— operators terminating TLS at AWS had to use Infisical secret-sync,
manual aws-cli imports, or external automation. This commit lands
the SDK-driven AWS Certificate Manager target connector that closes
the gap end-to-end.

Architecture:
  - internal/connector/target/awsacm/awsacm.go — Connector wraps
    *acm.Client behind the ACMClient interface seam (mirrors
    awsacmpca's ACMPCAClient pattern from the issuer side).
    LoadDefaultConfig handles the standard AWS credential chain
    (IRSA / EC2 instance profile / SSO / env vars); no embedded
    creds in connector Config.
  - Pre-deploy snapshot via DescribeCertificate + GetCertificate so
    on-import-failure rollback restores the previous cert. Mirrors
    the Bundle 5 IIS pattern + the Bundle 7/8 WinCertStore /
    JavaKeystore patterns. Surfaces rollback success/failure via
    the existing certctl_deploy_rollback_total Prometheus counter
    label set.
  - Provenance tags: certctl-managed-by=certctl + certctl-
    certificate-id=<mc-id> set automatically on every import. ACM
    strips tags on re-import, so the connector calls
    AddTagsToCertificate post-import to keep the provenance pair
    fresh. Operators looking up a cert ARN by managed-cert ID
    (Terraform data source, CloudFormation output) match against
    these tags.
  - DeploymentRequest.KeyPEM held in agent memory only — never
    written to disk. Aligns with the pull-only deployment model
    documented in CLAUDE.md.

Tests:
  - awsacm_test.go: 15-subtest happy-path + validation matrix
    covering ValidateConfig (success / missing-region / malformed-
    region / malformed-ARN / reserved-tag rejection),
    DeployCertificate (fresh import / rotate-in-place / rollback-
    on-serial-mismatch / rollback-also-fails / empty-key-rejected /
    no-client-rejected), ValidateOnly (returns sentinel),
    ValidateDeployment (serial match / mismatch / no-ARN-yet).
  - awsacm_failure_test.go: 5 per-error-class contract tests
    mirroring the awsacmpca_failure_test.go shape (commit
    a2a59a8) — AccessDeniedException (smithy.GenericAPIError),
    ResourceNotFoundException (typed), ThrottlingException
    (smithy.GenericAPIError, FaultServer preserved),
    InvalidArgsException (typed, terminal), RequestInProgress
    Exception (typed). All assert errors.As against the SDK type +
    operator-actionable substring + connector-side wrap framing.
  - Coverage on awsacm.go: 54.9% of statements (matches the K8s-
    Secret + IIS connectors' 50-65% range; rollback-failure paths
    contribute most of the un-covered surface — those exercise
    only when the rollback's SDK call also returns an error).
  - go test -race -count=10 green; no goroutine leaks.

Wiring:
  - internal/domain/connector.go: TargetTypeAWSACM = "AWSACM".
  - internal/service/target.go: validTargetTypes set extended.
  - cmd/agent/main.go::createTargetConnector: AWSACM case arm
    mirroring the KubernetesSecrets shape exactly. Calls
    awsacm.New(context.Background(), &cfg, a.logger) — the
    SDK-loading happens here, not lazily, so config errors
    surface at agent boot.
  - cmd/agent/agent_test.go::TestCreateTargetConnector_AllSupported
    Types: AWSACM added to the type matrix + the InvalidJSON
    matrix.

go.mod / go.sum:
  - github.com/aws/aws-sdk-go-v2/service/acm v1.38.3 (direct).
    aws-sdk-go-v2 + service/acmpca + smithy-go were already direct
    from the awsacmpca issuer; this is the distribution-side
    companion package.

Documentation:
  - docs/connectors.md "AWS Certificate Manager (ACM)" section:
    config table, IAM policy JSON (5 actions on
    arn:aws:acm:*:*:certificate/*), IRSA / EC2 instance-profile /
    SSO auth recipes, atomic-rollback contract, Terraform ALB-
    attachment snippet, threat model carve-outs (no disk writes,
    mandatory provenance tags, no long-lived creds in Config),
    procurement checklist crib (5 bullets paste-able into a
    security review).

Out of scope (intentional, flagged in V3-Pro forward path):
  - CloudFront / ALB auto-attach (UpdateDistribution requires a
    different IAM scope than ACM ImportCertificate).
  - Cross-region ACM replication (ACM is regional; CloudFront
    forces us-east-1).
  - Tag-filtered ARN discovery (V2 uses operator-pinned
    Config.CertificateArn after first deploy; tag-scan path
    requires acm:ListTagsForCertificate which we deliberately
    keep off the minimum-IAM-policy surface).
  - Azure Key Vault (separate cloud, separate connector — Azure
    half of Rank 5 ships in a follow-on commit).

Verified locally:
- gofmt clean.
- go vet ./internal/connector/target/awsacm/...
  ./internal/domain/... ./internal/service/...
  ./cmd/agent/...  clean.
- go test -short -count=1 ./internal/connector/target/awsacm/...
  ./internal/domain/... ./cmd/agent/...  green (15 + 5 awsacm
  subtests; all 15 supported target types instantiate via the
  agent factory).
- go test -race -count=10 ./internal/connector/target/awsacm/...
  green.

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 5.
Acquisition prompt:
cowork/rank-5-aws-acm-azure-kv-target-adapters-prompt.md.
2026-05-03 22:32:45 +00:00
shankar0123 109f32ff41 notifications: per-policy multi-channel expiry-alert routing
Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable
(see cowork/infisical-deep-research-results.md Part 5). Pre-fix,
RenewalService.CheckExpiringCertificates already ran daily,
RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and
NotificationService.SendThresholdAlert deduped per (cert, threshold)
— but the channel was hardcoded to Email
(internal/service/notification.go:118 pre-fix). Operators who
configured PagerDuty / Slack / Teams / OpsGenie via
CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold
unless SMTP was also wired. Their first signal of an expired cert
was a 3 AM outage.

This commit lands the routing matrix on top of the existing
infrastructure:

  1. RenewalPolicy gains AlertChannels (per-tier channel list) +
     AlertSeverityMap (per-threshold tier assignment) +
     EffectiveAlertChannels / EffectiveAlertSeverity accessors.
     Default*() helpers preserve the back-compat Email-only
     behaviour for operators who haven't touched their policies
     post-upgrade. Migration 000026 adds the JSONB columns
     idempotently.
  2. NotificationService.SendThresholdAlertOnChannel — the new
     per-channel dispatch helper. Old SendThresholdAlert stays as
     an Email-only alias so non-policy callers (admin "send test
     alert" surfaces) keep working byte-for-byte.
  3. NotificationService.HasThresholdNotificationOnChannel — per-
     (cert, threshold, channel) deduplication so a transient
     PagerDuty 5xx today does NOT suppress today's Slack alert and
     tomorrow's PagerDuty retry will still fire.
  4. RenewalService.sendThresholdAlerts walks the resolved channel
     set per threshold tier, fans out to every configured channel,
     handles per-channel failures independently, defensively drops
     off-enum channels with an audit row trail, and records a per-
     channel audit event with metadata.channel + metadata.severity_tier.
  5. service.ExpiryAlertMetrics — atomic counter table mirrored on
     the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5
     (commit 0792271). Three labels: channel × threshold × result
     (success / failure / deduped). Cardinality bound: 6 × 4 × 3 =
     72 series for the standard 4-threshold matrix.
  6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus
     exposer for certctl_expiry_alerts_total{channel,threshold,result}.
     Pre-sorted snapshot for byte-stable emission.
  7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics
     instance through both the recording side (notificationService.
     SetExpiryAlertMetrics) and the exposing side
     (metricsHandler.SetExpiryAlerts).

Dispatch flow (post-fix, per renewal-loop tick):

  cert ages past T-30  → daily renewal-loop fires
                       → policy lookup
                       → for each crossed threshold:
                           - resolve severity tier (informational/
                             warning/critical) via AlertSeverityMap
                           - look up channel set in AlertChannels[tier]
                           - for each channel: dedup → SendThresholdAlertOnChannel
                             → notifierRegistry[channel] → audit row →
                             Prometheus counter increment

Tests (internal/service/renewal_expiry_alerts_test.go):

  TestExpiryAlerts_DefaultMatrix_EmailOnly
  TestExpiryAlerts_PerTierFanOut
  TestExpiryAlerts_PerChannelDedup
  TestExpiryAlerts_OneChannelFails_OthersStillFire
  TestExpiryAlerts_OffEnumChannelDropped
  TestExpiryAlerts_MetricCounterIncrements
  TestExpiryAlerts_NilPolicy_FallsToDefault
  TestExpiryAlerts_OperatorOptOutOfTier

The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0
days through the canonical 4 thresholds with the matrix
{informational:[Slack], warning:[Slack,Email],
critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact
recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no
Teams, no Webhook. The OneChannelFails test pins that PagerDuty
returning a 503 does NOT skip Slack/Email at the same threshold.

Drive-by fix (internal/service/testutil_test.go): the existing
mockNotifRepo.List ignored its filter and returned all rows, which
let legacy tests pass on dedup-via-substring even though the
postgres repo actually applied the filter. Updated the mock to
honour CertificateID / Type / Status / Channel / MessageLike
filters in the same shape as the postgres implementation
(internal/repository/postgres/notification.go). All pre-existing
service tests still pass — the legacy test suite happened to be
robust to the mock filter doing nothing.

Documentation:
  - docs/connectors.md Notifier section gains "Routing expiry
    alerts across channels" — operator-facing, JSON example,
    procurement playbook ("How do I make sure PagerDuty pages on
    the T-1 alert?"), debug recipe via SQL on audit_events +
    notification_events + Prometheus.
  - docs/runbook-expiry-alerts.md — sysadmin-grade flowchart,
    per-policy channel-matrix configuration recipes, "did the on-
    call team get paged?" SQL queries, cardinality budget, V3-Pro
    forward path.
  - cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry
    alerts: per-owner routing" V3-Pro entry under Adapter
    hardening.

Out of scope (intentional, flagged in V3-Pro forward path):
  - Per-owner / per-team / per-tenant channel routing (matrix is
    per-policy today, not per-owner).
  - Calendar-aware suppression (no T-30 alerts on weekends).
  - Escalation chains (T-1 unanswered for 30m → escalate).
  - Per-channel rate limiting (downstream of I-005 retry+DLQ).

CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md
itself ("no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/domain/... ./internal/service/...
  ./internal/api/handler/... ./cmd/server/...  clean.
  (./internal/repository/postgres/... vet failed on transitive
  testcontainers/docker module download — sandbox disk pressure,
  not a code issue; postgres-repo build succeeds and tests pass.)
- go test -short -count=1 ./internal/domain/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestExpiryAlerts'
  ./internal/service/...  green (per-channel dedup race-free).

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4.
Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.
2026-05-03 22:12:32 +00:00
shankar0123 0792271dc6 vault: add automatic token renewal at TTL/2 + Prometheus metric
Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
VaultPKI adapter authenticated with a static token and never called
renew-self. Long-lived deploys hit token expiry; the first
operator-visible signal was failed cert renewals on production
targets.

This commit:

  1. Connector.Start(ctx) spawns a goroutine that calls
     POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a
     one-shot lookup-self at startup). Honours ctx.Done() for
     graceful shutdown via a per-loop done channel + Stop().
  2. On `renewable: false` response (initial lookup OR any subsequent
     renewal), the loop emits a WARN, increments the not_renewable
     counter, and exits. The operator must rotate the token before
     Vault's Max TTL elapses.
  3. New Prometheus counter certctl_vault_token_renewals_total with
     labels result={success,failure,not_renewable}. Registered
     alongside existing certctl_issuance_* counters in
     internal/api/handler/metrics.go.
  4. ERROR-level logging on renewal failure with operator-actionable
     substring ("vault token renewal failed; rotate the token before
     TTL expires") so journalctl + grep find it. Loop keeps ticking
     after a failure — transient blips don't kill it.

New optional issuer.Lifecycle interface:

  type Lifecycle interface {
      Start(ctx context.Context) error
      Stop()
  }

Connectors that hold no background goroutines (almost all of them)
do not implement this — IssuerRegistry.StartLifecycles /
StopLifecycles feature-detect via type assertion. New
lifecycle-bearing connectors plug in by implementing the interface;
no further registry plumbing required.

Wiring (cmd/server/main.go):

  - service.NewVaultRenewalMetrics() instance is shared between
    issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built
    by Rebuild get a recorder) and metricsHandler.SetVaultRenewals
    (so the Prometheus exposer emits the new series).
  - issuerRegistry.StartLifecycles(ctx) is called after
    issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles
    is paired so goroutines exit cleanly on signal.
  - IssuerConnectorAdapter.Underlying() exposes the wrapped
    issuer.Connector so registry-level machinery can reach the
    concrete connector behind the adapter without duplicating the
    wiring at every call site.

Tests (internal/connector/issuer/vault/vault_renew_test.go):

  - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three
    renewals, all "success".
  - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns
    renewable=false, loop exits, third tick fires no HTTP call.
  - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403
    bumps "failure", second renewal succeeds → loop kept ticking.
  - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns
    within 200ms after ctx cancel.
  - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token
    already non-renewable at boot ⇒ no goroutine, "not_renewable"
    metric increments at startup so operators see it in Grafana.
  - TestVault_ComputeInterval — 4 cases pinning TTL/2 +
    minRenewInterval floor.
  - TestVault_RenewSelf_ParseFailure_NamesActionableInError —
    surfaced error contains "vault token renewal failed" + "rotate
    the token".

Cadence is dynamic — every successful renewal re-derives TTL/2
from the renewed lease's lease_duration, so a short bootstrap
token that gets renewed up to a longer Max TTL shifts to the
longer cadence automatically (defends against degenerate fast
ticking on a token whose Max TTL is far longer than its initial
TTL).

Documentation:
  - docs/connectors.md Vault PKI section gains "Token TTL +
    automatic renewal" subsection (operator-facing: cadence, metric,
    renewable=false rotation playbook).

Out of scope (intentional, flagged in the audit follow-up):
  - AppRole / Kubernetes / AWS IAM auth methods (different renewal
    semantics).
  - Hot-reload of rotated token from disk (operator restarts
    today; future: GUI/MCP issuer-update path triggers Rebuild
    which Stops the old connector and Starts the new one).
  - Auto-re-auth after token death (operator playbook owns it).

CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md
itself: "no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/service/... ./internal/api/handler/...
  ./internal/connector/issuer/vault/... ./cmd/server/...  clean.
- go test -short -count=1 ./internal/connector/issuer/vault/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval'
  ./internal/connector/issuer/vault/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #5.
2026-05-03 21:24:27 +00:00
shankar0123 bee47f0318 acme-server: cert-manager integration test + production hardening (Phase 5/7)
Closes the production-readiness loop on the ACME surface. After this
commit, certctl ships per-account rate limits + a GC sweeper for
expired ACME state + a kind-driven cert-manager 1.15 integration test
+ a lego-driven RFC conformance harness + a k6 loadtest scenario for
the unauthenticated ACME path.

Architecture:
  - Rate limits live in-memory + per-replica. Restart wipes the
    counters; orders/hour caps are eventual-consistency anyway. A
    3-replica certctl-server fleet behind an LB effectively has 3x
    the configured throughput per account; persistent rate limiting
    is a follow-up if production telemetry shows abuse patterns we
    can't catch in a single restart cycle. Per-key + per-action
    isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and
    ActionChallengeRespond/<challenge-id> are independent buckets.
  - GC loop follows the existing scheduler-loop pattern (atomic.Bool
    + sync.WaitGroup; see crlGenerationLoop for shape). Three
    independent SQL sweeps per tick (DELETE expired nonces; UPDATE
    pending authzs whose expires_at < now() to expired; UPDATE
    pending/ready/processing orders whose expires_at < now() to
    invalid). Each sweep is a single statement; failures are logged-
    and-continued so a failing nonces sweep doesn't block authzs.
    Per-sweep 1m timeout bounds a stuck Postgres.
  - cert-manager integration test is gated on KIND_AVAILABLE so CI
    skips it cleanly (kind is too heavy for per-PR). Operators run
    locally via 'make acme-cert-manager-test'; the harness brings up
    a fresh cluster each run + tears it down on Cleanup.
  - lego conformance harness drives a real ACME client through
    register → run → cert-PEM-landed against a hermetic certctl
    stack. Catches RFC-shape regressions third-party clients would
    hit before they ship.
  - k6 ACME-flow scenario hammers the unauthenticated surface
    (directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS-
    signed flows are out of scope for k6 (no JWS support); they're
    covered by the lego harness above.

What ships:
  - internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases —
    disable-when-perHour-zero, capacity, per-key isolation, per-
    action isolation, refill-over-time, RetryAfter, concurrent-access
    with -race + 200 goroutines × 200 calls).
  - internal/repository/postgres/acme.go: 4 new methods —
    CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations
    + GCInvalidateExpiredOrders. Each a single SQL statement.
  - internal/service/acme.go: SetRateLimiter + GarbageCollect +
    rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey
    + RespondToChallenge) + concurrent-orders gate at CreateOrder.
    2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded);
    5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped /
    gc_authzs_expired / gc_orders_invalidated).
  - internal/scheduler/scheduler.go: ACMEGarbageCollector interface +
    acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME-
    GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the
    crlGenerationLoop shape.
  - internal/api/handler/acme.go: writeServiceError gains rateLimited
    (429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings.
  - internal/config/config.go: 5 new env vars
    (CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60,
    CERTCTL_ACME_SERVER_GC_INTERVAL=1m).
  - cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at
    startup; conditional SetACMEGarbageCollector(acmeService) +
    SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+
    GCInterval > 0.
  - deploy/test/acme-integration/: kind-config.yaml + cert-manager-
    install.sh + clusterissuer-trust-authenticated.yaml +
    clusterissuer-challenge.yaml + certificate-test.yaml + conformance-
    lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE
    gate).
  - deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section.
  - Makefile: 2 new PHONY targets (acme-cert-manager-test +
    acme-rfc-conformance-test).
  - docs/acme-server.md: status flipped to Phase 5; Configuration
    table grows 5 rows; new 'Phase 5 — operational guidance' section
    explaining rate-limit math + GC sweeper semantics + cert-manager
    integration + lego conformance + k6 baseline.

Tests:
  - 'go vet ./...' clean across the repo.
  - 'go test -short -count=1 ./internal/...' green across every
    affected package (service / acme / handler / scheduler / repo /
    config).
  - 'go vet -tags=integration ./deploy/test/acme-integration/' clean
    (the integration test compiles cleanly with the build tag).
  - The kind/cert-manager harness is gated behind KIND_AVAILABLE so
    CI skips by default; operators run locally via 'make acme-cert-
    manager-test'.

Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
2026-05-03 19:42:03 +00:00
shankar0123 1e1bc9b3b4 ci: fix Phase 4 post-push unused-symbol failures
CI on commit f6ba563 (Phase 4 gofmt fix) failed golangci-lint's
'unused' linter on internal/service/acme_phase4_test.go: the
stubRenewalPolicies type + its Get method were defined for a future
RenewalInfo happy-path test that I never actually wrote — only the
disabled + bad-cert-id negatives. The dead-code carried forward
because go vet doesn't catch unused-but-exported-shape, and the
package-private use never materialized.

Fix: delete the stubRenewalPolicies type + its method + the
adjacent stub-comment that referenced a similarly-imagined
stubIssuerConn that was never written either. The tests I have
(RotateAccountKey happy + duplicate, RevokeCert kid + jwk paths +
already-revoked + reason-clamping, RenewalInfo disabled +
bad-cert-id) all still pass — they don't reference the removed
type. The window-math is exercised directly in
internal/api/acme/phase4_test.go::TestComputeRenewalWindow_*; the
service-layer policy-lookup wiring is read at handler smoke time
in Phase 5.

Confirmed: 'gofmt -l .' clean; 'go vet ./internal/service/' clean;
'go test -short -count=1 ./internal/service/' green. Pre-commit
verification gate updated implicitly: future Phase commits should
spot-check unused-shape via grep against the test file (every
stub* helper should have ≥3 references, matching the live
helpers' usage profile).
2026-05-03 19:02:44 +00:00
shankar0123 4dc8d3fa5b acme-server: key rollover + revocation + ARI (Phase 4/7)
Closes the RFC 8555 + RFC 9773 surface beyond the issuance happy-path:
  - POST /acme/profile/<id>/key-change   (RFC 8555 §7.3.5)
  - POST /acme/profile/<id>/revoke-cert  (RFC 8555 §7.6)
  - GET  /acme/profile/<id>/renewal-info/<cert-id>  (RFC 9773 ARI)

After this commit, ACME clients can rotate account keys, revoke certs
through the ACME surface (rather than only via the certctl GUI/API),
and fetch ARI for proactive renewal scheduling.

Architecture:
  - Key rollover: outer JWS verified against the registered account key
    (existing kid path); the inner JWS — embedded as the outer's payload
    — verified against the embedded NEW jwk in a new dedicated routine
    (ParseAndVerifyKeyChangeInner) that enforces RFC 8555 §7.3.5
    inner-only invariants: MUST use jwk + MUST NOT use kid, payload
    .account == outer.kid, payload.oldKey thumbprint-equals registered.
    A single WithinTx swaps the stored thumbprint+pem and writes the
    audit row. Concurrent-rollover safety via SELECT…FOR UPDATE on the
    conflicting account row in UpdateAccountJWKWithTx; the loser
    observes the winner's new thumbprint and is told to retry (409).
  - Revocation: two auth paths. kid → AccountOwnsCertificate single-
    indexed COUNT lookup over acme_orders. jwk → constant-time RFC 7638
    thumbprint compare against the cert's pubkey. Both paths route
    through service.RevocationSvc.RevokeCertificateWithActor so the
    existing CRL/OCSP refresh + audit + metrics pipeline applies. RFC
    5280 §5.3.1 numeric reason codes clamp to certctl's
    domain.ValidRevocationReasons; codes 8 (removeFromCRL) + 10
    (aACompromise) clamp to 'unspecified' since they aren't in the set.
  - ARI is GET-only and unauth per RFC 9773 §4. Cert-id wire shape is
    base64url(AKI).base64url(serial); ParseARICertID strict-decodes,
    SerialHex emits the canonical certctl-shape lowercase-no-leading-
    zeros hex used in certificate_versions.serial_number.
    ComputeRenewalWindow has 3 branches: bound RenewalPolicy →
    [notAfter - days, notAfter - days/2]; no policy → last 33% of
    validity; past expiry → [now, now + 1d] (renew immediately).
    Retry-After honors CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL.

What ships:
  - internal/api/acme/{keychange,ari}.go (+ phase4_test.go: 15 tests).
  - internal/api/acme/order.go: RevokeCertRequest wire shape.
  - internal/api/handler/acme.go: KeyChange, RevokeCert, RenewalInfo
    + 11 new writeServiceError mappings.
  - internal/repository/postgres/acme.go: UpdateAccountJWKWithTx (FOR
    UPDATE + expectedOldThumbprint precondition; ErrACMEAccountKey-
    ConcurrentUpdate sentinel) + AccountOwnsCertificate.
  - internal/service/acme.go: RotateAccountKey + RevokeCert +
    RenewalInfo; CertificateRevoker + RenewalPolicyLookup interfaces;
    SetRevocationDelegate + SetRenewalPolicyLookup wiring; 11 new
    sentinels; 6 new metrics.
  - internal/service/acme_phase4_test.go: service-layer tests for
    RotateAccountKey (happy + duplicate-key) + RevokeCert (kid mismatch
    + jwk mismatch + jwk happy + already-revoked + reason-clamping) +
    RenewalInfo (disabled + bad cert-id).
  - internal/api/router/router.go: 6 new register calls (3 per-profile
    + 3 shorthand). Router parity exceptions extended in lockstep
    (in-tree SpecParityExceptions + CI-only openapi-handler-exceptions
    .yaml).
  - cmd/server/main.go: SetRevocationDelegate(revocationSvc) +
    SetRenewalPolicyLookup(renewalPolicyRepo) at startup.
  - internal/config/config.go: CERTCTL_ACME_SERVER_ARI_ENABLED (default
    true) + CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL (default 6h);
    BuildDirectory's ariEnabled flag now flips on under
    cfg.ARIEnabled.
  - docs/acme-server.md: phase status flipped to Phase 4; endpoints
    table grows 6 rows (3 per-profile + 3 shorthand); FAQ section
    appended explaining how to rotate keys, revoke certs, and consume
    ARI.

Tests:
  - 'go vet ./...' clean across the repo.
  - 'go test -short -count=1 ./...' green across every package.
  - phase4_test.go covers: keychange happy-path + 5 negatives +
    MapKeyChangeErrorToProblem coverage; ARI cert-id round-trip + 6
    malformed cases + BuildARICertID from a generated cert; window-
    math 3 branches.
  - service-layer tests confirm: RotateAccountKey atomically swaps the
    thumbprint (verifies persisted state) and rejects duplicate keys;
    RevokeCert routes through the stub RevocationSvc with the right
    actor string + reason on the jwk path, rejects mismatched keys,
    rejects already-revoked certs, clamps reason codes correctly;
    RenewalInfo respects ARIEnabled + cert-id format.

Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-4'.
2026-05-03 16:51:06 +00:00
shankar0123 62513ad12f ci: fix Phase 3 post-push CI failures (contextcheck + ST1021)
CI on commit 9bc8453 (Phase 3 challenges) failed three lint checks under
golangci-lint. Two were contextcheck on internal/service/acme.go
RespondToChallenge, where the validator-pool dispatch deliberately
detached from the request ctx via 'context.Background()' so the async
WithinTx survives the HTTP handler returning. contextcheck rightly
flagged the non-inherited context — the canonical Go 1.21+ answer for
this exact pattern is context.WithoutCancel(ctx), which preserves
inherited values (logger, trace IDs, audit actor) but detaches
cancellation. Swapping that in clears both contextcheck hits.

The third was ST1021 on internal/api/acme/validators.go: a comment
intended for the (*Pool).Snapshot() method had landed above the
PoolSnapshot type by accident. Split the comment — one prose line
for the type, one for the method — so each exported symbol carries
its own properly-anchored doc.

Confirmed local 'go vet' clean and 'go test -short -count=1' green
across internal/service/ and internal/api/acme/ before commit.
2026-05-03 15:56:03 +00:00
shankar0123 9bc845304e acme-server: HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (Phase 3/7)
Wires up the actual challenge-validation machinery so profiles in
acme_auth_mode='challenge' resolve end-to-end. After this commit,
cert-manager 1.15+ with `solver: http01: ingress` against a
challenge-mode profile completes a real HTTP-01 flow and gets a cert.
DNS-01 + TLS-ALPN-01 share the same code path with the appropriate
validator selection.

Architecture (the load-bearing parts):
  - 3 separate semaphore-bounded worker pools (one per challenge type),
    so HTTP-01 and DNS-01 can't starve each other under load. Default
    weight 10 per type; tunable via CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY,
    DNS01_CONCURRENCY, TLSALPN01_CONCURRENCY.
  - 30s per-challenge timeout (configurable via PoolConfig.PerChallengeTimeout).
  - HTTP-01 validator runs validation.IsReservedIPForDial (newly
    exported wrapper preserving the existing private impl byte-for-byte
    for the network scanner + ValidateSafeURL paths) on the resolved
    IP — both at the initial dial and every redirect hop. SSRF probes
    into private IP space are refused before the connect.
  - DNS-01 validator uses a dedicated resolver pointed at
    CERTCTL_ACME_SERVER_DNS01_RESOLVER (default 8.8.8.8:53) — does
    NOT use the system resolver to keep behavior deterministic across
    deployments. Wildcard handling: `*.example.com` queries
    _acme-challenge.example.com.
  - TLS-ALPN-01 validator (RFC 8737) connects with ALPN `acme-tls/1`,
    inspects the id-pe-acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31),
    asserts the ASN.1 OCTET STRING value equals SHA-256 of the key
    authorization. Cert chain is intentionally NOT validated
    (InsecureSkipVerify=true is correct per RFC 8737 — the proof is
    in the extension, not the chain). Documented in docs/tls.md L-001
    table + the //nolint:gosec comment carries the justification.
    SSRF guard: same posture as HTTP-01.
  - Validation is asynchronous: handler accepts the POST and returns
    200 immediately with status=processing; the worker-pool fires a
    callback that updates challenge → authz → order in a fresh
    background-context WithinTx. The order auto-promotes to `ready`
    when ALL authzs become valid; auto-fails to `invalid` when ANY
    authz becomes invalid.

What ships:
  - internal/api/acme/challenge.go: KeyAuthorization (RFC 8555 §8.1) +
    DNS01TXTRecordValue (§8.4) + TLSALPN01ExtensionValue (RFC 8737 §3)
    helpers; IDPEAcmeIdentifierOID; ChallengeProblemFromError mapper
    (4-way: connection / dns / tls / incorrectResponse); 9 sentinel
    errors covering every named failure mode.
  - internal/api/acme/validators.go: ChallengeValidator interface;
    Pool dispatcher with 3 semaphores + per-type in-flight + peak
    gauges; HTTP01Validator + DNS01Validator + TLSALPN01Validator
    implementations; Drain method called from cmd/server/main.go's
    shutdown sequence.
  - internal/api/acme/validators_test.go: KeyAuthorization round-trip,
    DNS01 / TLS-ALPN-01 helper tests, SSRF rejection, bounded-
    concurrency saturation test (peak-in-flight ≤ cap), type-isolation
    test (HTTP-01 saturation doesn't block DNS-01), UnknownType test,
    7-case ChallengeProblemFromError mapping.
  - internal/repository/postgres/acme.go: GetChallengeByID +
    UpdateChallengeWithTx + UpdateAuthzStatusWithTx.
  - internal/service/acme.go: SetValidatorPool wires the *acme.Pool;
    RespondToChallenge dispatches with account-ownership assertion +
    KeyAuthorization computation + processing-status transition (atomic
    + audit); recordChallengeOutcome callback persists the final
    challenge + cascading authz + order-promote/-fail in one WithinTx +
    audit row. 4 new metrics.
  - internal/api/handler/acme.go: Challenge handler; round-trips
    account.JWKPEM through ParseJWKFromPEM to recover the *jose.JSONWebKey
    the validator pool needs.
  - internal/api/router/router.go + openapi_parity_test.go +
    api/openapi-handler-exceptions.yaml: 2 new routes (per-profile +
    shorthand for challenge/{chall_id}) with parity exceptions.
  - cmd/server/main.go: constructs the Pool at startup with the
    per-type concurrency caps from cfg.ACMEServer; ACMEService.ValidatorPool()
    accessor exposed for the shutdown drain sequence.
  - internal/validation/ssrf.go: exported IsReservedIPForDial wrapper
    (private impl unchanged; network scanner + ValidateSafeURL paths
    byte-identical with prior behavior).
  - docs/tls.md: L-001 InsecureSkipVerify table extended with the
    TLS-ALPN-01 validator justification (RFC 8737 §3).
  - docs/acme-server.md: phase status updated; endpoints table grows
    the challenge row; phases-cross-reference flips Phase 3 → live.

Tests:
  - 80%+ coverage on the new files.
  - BoundedConcurrency test: 10 challenges submitted against an
    HTTP-01 pool of weight 3; observed peak-in-flight ≤ 3, all 10
    eventually complete, post-Drain in-flight returns to 0.
  - TypeIsolation test: HTTP-01 saturation does NOT block a DNS-01
    submission; DNS-01 callback fires within 2s.
  - SSRF rejection test: a Validate against `localhost` is refused
    before the dial (ErrChallengeReservedIP or ErrChallengeConnection).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-3".
2026-05-03 14:09:00 +00:00
shankar0123 c351bba41a acme-server: orders + authorizations + finalize + cert download (Phase 2/7)
Closes the issuance loop in trust_authenticated mode (commits ec88a61
+ 44a85d6 wired the foundation + JWS-verified account resource).
After this commit, an ACME client running against a profile with
acme_auth_mode='trust_authenticated' end-to-end-issues a real cert:

  POST /acme/profile/<id>/new-order      → 201 + order URL (status=ready)
  POST /acme/profile/<id>/order/<oid>    → POST-as-GET fetch
  POST /acme/profile/<id>/order/<oid>/finalize  → 200 + status=valid + cert URL
  POST /acme/profile/<id>/cert/<cid>     → 200 + PEM chain

Profiles with acme_auth_mode='challenge' get the same code path with
authz/challenge rows in `pending` state until Phase 3's validators
wire up. The mode is read from the bound profile's column at request
time, NOT cached at server start — operators flipping the column via
SQL take effect on the next order without restart.

Architecture (the load-bearing part):
  - Finalize routes through service.CertificateService.Create — the
    canonical certctl issuance entry point that wraps the
    managed_certificates row insert + audit row in s.tx.WithinTx.
    RenewalPolicy / CertificateProfile / per-issuer-type Prometheus
    metrics / audit rows all apply uniformly to ACME-issued certs via
    the same code path that already serves EST/SCEP/agent/REST issuance.
  - Identifier validation runs BEFORE order creation. Rejected
    identifiers return RFC 7807 with per-identifier subproblems and
    create no order row.
  - Source stamp on managed_certificates: domain.CertificateSourceACME.
    Operators bulk-revoke ACME-issued certs by filtering on Source=ACME.
  - 3-step atomicity boundary documented in code + this commit msg:
    (A) WithinTx-A marks order processing + audit row.
    (B) IssuerConnector.IssueCertificate + CertificateService.Create
        (each in its own WithinTx — Create wraps cert row + audit
        atomically).
    (C) WithinTx-C creates certificate_versions row + transitions order
        to valid + sets certificate_id + audit row.
    The brief window between B and C can leave a managed_certificates
    row whose order is still in `processing`. Phase 5's GC scheduler
    reconciles. Documented inline.

What ships:
  - internal/api/acme/order.go: OrderResponseJSON + AuthorizationResponseJSON
    + ChallengeResponseJSON + NewOrderRequest + FinalizeRequest wire
    shapes; ValidateIdentifiers (Phase 2 syntactic checks, dns-only);
    CSRMatchesIdentifiers (RFC 8555 §7.4 strict equality, case-folded).
  - internal/domain/acme.go: ACMEOrder + ACMEAuthorization + ACMEChallenge
    + ACMEIdentifier + ACMEProblem domain types + closed status enums
    for each (order: pending|ready|processing|valid|invalid; authz:
    pending|valid|invalid|deactivated|expired|revoked; challenge:
    pending|processing|valid|invalid; challenge type: http-01|dns-01|
    tls-alpn-01).
  - internal/domain/profile.go: new ACMEAuthMode field reading from
    certificate_profiles.acme_auth_mode (added in migration 25).
  - internal/domain/certificate.go: new CertificateSourceACME enum value.
  - internal/repository/postgres/profile.go: extended SELECT/scanProfile
    to read the per-profile acme_auth_mode column with a COALESCE
    default of trust_authenticated.
  - internal/repository/postgres/acme.go: full order/authz/challenge
    CRUD (CreateOrderWithTx + GetOrderByID + UpdateOrderWithTx +
    CreateAuthzWithTx + GetAuthzByID + ListAuthzsByOrder +
    ListChallengesByAuthz + CreateChallengeWithTx) with proper
    sql.NullTime + JSONB handling. scanACMEOrder /
    scanACMEAuthz / scanACMEChallenge helpers.
  - internal/service/acme.go: extended ACMERepo interface; new
    SetIssuancePipeline wires certificateService + certificateRepo +
    issuerRegistry. CreateOrder (auth-mode-dispatched: trust_authenticated
    auto-marks order ready + authz valid + 1 placeholder http-01
    challenge valid; challenge mode keeps everything pending). LookupOrder
    (with account-ownership assertion). LookupAuthz. ListAuthzsByOrder.
    FinalizeOrder (3-step atomicity boundary as above; CSR-vs-order
    SAN strict-equality check before issuance; persists FinalizeOrderResult
    {Order, CertID}). LookupCertificate. randIDSuffix + base32encode
    helpers for the human-readable acme-ord-* / acme-authz-* /
    acme-chall-* prefixes (CLAUDE.md "TEXT primary keys with human-
    readable prefixes" architecture decision). 8 new per-op metrics.
  - internal/service/acme_test.go: extended fakeACMERepo with Phase 2
    interface stubs; new orderTrackingRepo for observable persistence;
    2 new tests asserting trust_authenticated → auto-ready/valid and
    challenge → stays-pending.
  - internal/api/handler/acme.go: NewOrder + Order + OrderFinalize +
    Authz + Cert handler methods. orderURL / authzURL / certURL /
    challengeURLBuilder helpers; marshalOrderForResponse fetches
    per-order authzs to populate the URL list. parseOptionalTime for
    notBefore / notAfter.
  - internal/api/handler/acme_handler_test.go: extended mockACMEService
    with Phase 2 method stubs; 4 new handler tests (NewOrder happy +
    rejected-identifier + OrderFinalize bad-CSR + Cert happy).
  - internal/api/router/router.go: 10 new Register calls (5 per-profile
    + 5 shorthand) for new-order, order/{ord_id}, order/{ord_id}/finalize,
    authz/{authz_id}, cert/{cert_id}.
  - internal/api/router/openapi_parity_test.go + api/openapi-handler-exceptions.yaml:
    10 new exception entries.
  - cmd/server/main.go: SetIssuancePipeline at startup, threading
    certificateService + certificateRepo + issuerRegistry into ACMEService.
  - docs/acme-server.md: phase status updated; endpoints table grows
    5 rows for new-order/order/finalize/authz/cert (per-profile +
    shorthand variants); new section "Finalize routing through
    CertificateService.Create" documenting the 3-step atomicity
    boundary + the actor-string convention `acme:<account-id>`.

Tests: ACME package + service + handler + router + config + domain
all green under -short. New cases:
  - TestCreateOrder_TrustAuthenticated_AutoReady (asserts auto-ready
    transition + valid-status authz/challenge + audit row + metric bump).
  - TestCreateOrder_ChallengeMode_StaysPending (asserts pending-status
    cascading authz/challenge for challenge mode).
  - TestACMEHandler_NewOrder_HappyPath (asserts 201 + Location +
    finalize URL shape).
  - TestACMEHandler_NewOrder_RejectedIdentifier (asserts 400 + RFC 7807
    rejectedIdentifier + per-identifier subproblems for type=ip).
  - TestACMEHandler_OrderFinalize_BadCSR (asserts 400 + badCSR for
    non-base64 CSR field).
  - TestACMEHandler_Cert_HappyPath (asserts 200 + PEM content-type +
    PEM chain in body).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-2".
2026-05-03 13:46:10 +00:00
shankar0123 44a85d6f85 acme-server: account resource + JWS verifier (Phase 1b/7)
Layers JWS-authenticated POST machinery onto the Phase 1a foundation
(commit ec88a61). After this commit, an ACME client can run

  POST /acme/profile/<id>/new-account

against certctl and successfully register an account. Account update
+ deactivation via POST /acme/profile/<id>/account/<acc-id> work.
Orders + challenges remain Phase 2 / 3.

Background:
  Two prior dispatch attempts at the original Phase 1 ("skeleton +
  directory + new-nonce + new-account" as a single commit) failed on
  go-jose v4 API speculation (jws.GetPayload, sig.Algorithm,
  jose.SHA256, etc. — none of those exist in v4). Splitting Phase 1
  into 1a (foundation, no go-jose) and 1b (this commit, all go-jose
  in one place) concentrated the JWS work where attention pays off.
  The verifier reads the actual go-jose v4 surface — ParseSigned with
  closed alg allow-list, Header struct fields (Algorithm, KeyID,
  JSONWebKey, Nonce, ExtraHeaders[HeaderKey]), JWK.Thumbprint with
  stdlib crypto.SHA256.

What ships:
  - internal/api/acme/jws.go: 487-line verifier + sentinel error
    family. Enforces RFC 8555 §6.2 + §6.4 + §6.5 invariants:
      - alg in {RS256, ES256, EdDSA} (closed allow-list passed to
        jose.ParseSigned — HS256 / none / etc. rejected at parse time)
      - exactly one of `kid` / `jwk` in protected header (per
        endpoint policy — new-account demands jwk, others demand kid)
      - protected `url` matches request URL exactly
      - protected `nonce` consumed against acme_nonces (badNonce on
        miss/replay/expiry per RFC 8555 §6.5.1)
      - kid round-trips against canonical AccountKID(accountID) URL
        (catches cross-profile / cross-host replay)
      - kid path: account exists + status=valid (deactivated /
        revoked accounts cannot authenticate)
      - signature verifies; post-Verify payload bytes equal
        UnsafePayloadWithoutVerification (defense in depth)
    + JWK persistence helpers (JWKToPEM / ParseJWKFromPEM round-
    trip a public-only JWK as a PEM-wrapped JSON envelope; stored
    as TEXT in acme_accounts.jwk_pem for diff-friendliness) +
    JWKThumbprint per RFC 7638.
  - internal/api/acme/jws_test.go: 16 cases covering happy paths
    (RS256 kid, ES256 jwk, EdDSA kid) + every named failure mode
    (alg-not-allowed, bad-sig, missing-nonce, unknown-nonce,
    replay, url-mismatch, mixed kid+jwk, deactivated-account,
    cross-host kid). Uses real keypairs + real go-jose Signer to
    build JWS objects.
  - internal/api/acme/account.go: NewAccountRequest /
    AccountUpdateRequest payload shapes (RFC 8555 §7.3 + §7.3.2 +
    §7.3.6) + AccountResponseJSON wire shape + MarshalAccount
    helper.
  - internal/domain/acme.go: ACMEAccount struct + ACMEAccountStatus
    closed enum (valid / deactivated / revoked).
  - internal/repository/postgres/acme.go: full account CRUD path
    (CreateAccountWithTx with 23505-unique-violation sentinel
    translation, GetAccountByID, GetAccountByThumbprint,
    UpdateAccountContactWithTx, UpdateAccountStatusWithTx) +
    sql.ErrNoRows-wrapped repository.ErrNotFound on lookup misses.
  - internal/service/acme.go: ACMERepo interface extended;
    SetTransactor + SetAuditService wires; NewAccount (idempotent
    re-registration per RFC 8555 §7.3.1 — same JWK returns existing
    row without an update or new audit event); LookupAccount;
    UpdateAccount; DeactivateAccount; VerifyJWS adapter that bridges
    api/acme.VerifierConfig to the service-layer ACMERepo; per-op
    metrics extended (new_account_total + _failures_total +
    _idempotent_total + update_account_total + _failures_total +
    deactivate_account_total).
  - internal/service/acme_test.go: 8 new tests covering
    new-account happy path / idempotent re-registration / only-
    return-existing match + no-match / contact update / deactivate
    / lookup-not-found / requires-transactor.
  - internal/api/handler/acme.go: NewAccount + Account handlers.
    Account dispatches POST-as-GET (RFC 8555 §6.3 — empty body or
    {} payload returns the account row), contact update, and
    deactivation from the same endpoint. Defense-in-depth check
    that the kid path-segment matches the URL path-segment (the
    verifier already round-tripped the kid against canonical URL,
    but the handler re-asserts to catch any future verifier
    refactor).
  - internal/api/handler/acme_handler_test.go: 7 new cases
    covering happy-create, idempotent-200, only-return-existing-
    no-match-400, malformed-JWS-400, kid-URL-mismatch-401,
    deactivate, contact-update, POST-as-GET.
  - internal/api/router/router.go: 4 new Register calls (per-
    profile + shorthand for new-account and account/{acc_id}).
  - internal/api/router/openapi_parity_test.go: SpecParityExceptions
    extended with the 4 new routes (RFC 8555 wire-protocol surface,
    not OpenAPI-shaped — same precedent as Phase 1a).
  - cmd/server/main.go: SetTransactor + SetAuditService on
    acmeService at startup so the WithinTx-based new-account /
    update / deactivate paths run with the same transactor instance
    shared across CertificateService / RevocationSvc / RenewalService.
  - docs/acme-server.md: Phase status updated; endpoints table grows
    new-account + account/<acc_id> rows; new "JWS verification
    (Phase 1b)" section enumerates the 7 invariants the verifier
    enforces; phases-cross-reference table marks 1b live.
  - go.mod / go.sum: github.com/go-jose/go-jose/v4 v4.0.4 added.

Atomicity: every account-state mutation writes its acme_accounts row
+ its audit_events row inside one repository.Transactor.WithinTx
call — the canonical certctl atomicity contract (matches
CertificateService.Create at internal/service/certificate.go:131).
Idempotent re-registration explicitly does NOT write an audit row
(RFC 8555 §7.3.1 returns the existing row unmodified).

Tests: 16 jws_test.go cases + 11 service tests + 11 handler tests
all pass under -short. Bad-signature test uses a real registered
account whose stored JWK is a different keypair from the signer's,
so the JWS parses cleanly but jose.Verify rejects — exercises the
ErrJWSSignatureInvalid path directly.

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-1b".
2026-05-03 13:21:56 +00:00
shankar0123 ec88a61274 acme-server: foundation — directory + new-nonce + per-profile routing (Phase 1a/7)
First slice of the RFC 8555 ACME server endpoint (master plan at
cowork/acme-server-endpoint-prompt.md, per-phase prompts at
cowork/acme-server-prompts/). This commit lands the smallest viable
end-to-end deployable slice: an ACME client running

  curl -sk https://certctl/acme/profile/<id>/directory
  curl -sk -I https://certctl/acme/profile/<id>/new-nonce

successfully fetches the directory document and a Replay-Nonce.
Account creation, JWS verification, orders, challenges, and
revocation are all out of scope for this phase and arrive in Phases
1b–4.

Closes the Rank 1 LHF from the 2026-05-03 Infisical deep-research
(cowork/infisical-deep-research-results.md). Pre-fix, certctl was an
ACME consumer only — no /acme/directory endpoint, no JWS verifier,
no challenge validators. K8s customers running cert-manager could
not point at certctl as an ACME issuer; they had to deploy a certctl
agent on every node.

What ships:
  - internal/api/acme/{directory,nonce,errors}.go (+ tests).
  - internal/api/handler/acme.go + acme_handler_test.go.
  - internal/repository/postgres/acme.go (nonce ops only — Phase 1b
    extends with account CRUD; Phases 2-4 extend with order / authz /
    challenge CRUD).
  - internal/service/acme.go (BuildDirectory + IssueNonce stubs;
    Phase 1b adds VerifyJWS / NewAccount / etc.).
  - migrations/000025_acme_server.{up,down}.sql ships the full 5-table
    ACME schema (acme_accounts / acme_orders / acme_authorizations /
    acme_challenges / acme_nonces) PLUS the per-profile
    certificate_profiles.acme_auth_mode column. Phase 1a actively
    uses only acme_nonces; remaining tables are empty until Phases
    1b-4 plug in.
  - internal/config/config.go: ACMEServerConfig struct + ACMEServer
    field on Config. Env vars use CERTCTL_ACME_SERVER_* prefix to
    avoid colliding with the existing consumer-side ACMEConfig at
    config.go:1746 (CERTCTL_ACME_DIRECTORY_URL / PROFILE /
    CHALLENGE_TYPE etc.). Phase 1a wires Enabled +
    DefaultAuthMode + DefaultProfileID + NonceTTL + DirectoryMeta;
    Order/Authz TTLs + per-challenge-type concurrency caps + DNS01
    resolver are reserved fields parsed in 1a so operators can set
    them ahead of Phases 2/3.
  - cmd/server/main.go: wire ACMEHandler into the HandlerRegistry
    literal alongside the existing certificate / EST / SCEP / etc.
    handlers.
  - internal/api/router/router.go: HandlerRegistry.ACME field + 6
    Register calls (3 per-profile + 3 shorthand).
  - internal/api/router/openapi_parity_test.go: 6 new entries in
    SpecParityExceptions. ACME is a wire-protocol surface (JWS-signed
    JSON over HTTPS per RFC 7515) whose semantics are dictated by
    RFC 8555 + RFC 9773 rather than by an OpenAPI document, same
    precedent as SCEP/EST. The canonical reference is
    docs/acme-server.md.
  - docs/acme-server.md: Phase-1a-shaped reference. Configuration
    table for every CERTCTL_ACME_SERVER_* env var. Per-profile
    auth-mode decision tree skeleton. TLS trust bootstrap section
    flagging cert-manager's ClusterIssuer.spec.acme.caBundle
    requirement (the single biggest first-time-deploy footgun;
    the full cert-manager walkthrough lands in Phase 6 but the
    requirement is documented up front).

Architecture decisions baked in:
  - URL family is /acme/profile/<id>/* (per-profile, canonical) with
    /acme/* shorthand active when CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID
    is set. Path matches existing per-profile precedent in EST + SCEP.
  - Auth mode is per-profile (acme_auth_mode column on
    certificate_profiles), NOT server-wide. One certctl-server can
    serve trust_authenticated for an internal-PKI profile and
    challenge for a public-trust-style profile simultaneously. The
    column is read at request time, not cached at server start —
    operators flipping a profile's mode via SQL take effect on the
    next order without restart.
  - Nonces are DB-backed (acme_nonces table). Survive server restart.
    The RFC 8555 §6.5 replay defense requires the store to outlast
    the client's nonce caching window; an in-memory-only nonce
    store would lose every in-flight order on restart.
  - Per-op atomic counters on service.ACMEService.Metrics() —
    certctl_acme_directory_total, certctl_acme_directory_failures_total,
    certctl_acme_new_nonce_total, certctl_acme_new_nonce_failures_total.
    Naming follows certctl frozen decision 0.10 cardinality discipline.
    Phase 1b will extend with new_account counters; Phase 2 with
    order / finalize / cert; Phase 3 with per-challenge-type counters.

Audit fixes #11 + #12 (cowork/acme-server-prompts/audit-additions.md)
applied:
  - #11: CERTCTL_ACME_SERVER_* prefix avoids the consumer-side
    CERTCTL_ACME_* namespace collision.
  - #12: prior-attempt WIP from two failed Phase-1 dispatches was
    discarded at phase start; this commit starts from a clean tree.

Tests:
  - 14 unit tests in internal/api/acme/ (directory, nonce, errors).
  - 7 handler-level tests via httptest.NewServer + mockACMEService
    (mirrors the mockSCEPService pattern at scep_handler_test.go).
  - 7 service-layer tests with mocked repo + injected profileLookup.
  - All pass under -race -count=1 -short.

Deferred to Phase 1b:
  - JWS verification (go-jose v4 — see master-prompt §8a for the API
    surface and audit doc for the speculation pitfalls).
  - new-account / account/<id> endpoints + AccountService.
  - Nonce *consumption* path (issue path is in this commit; consume
    is only invoked by JWS-verified POSTs which Phase 1b adds).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-1a".
Per-phase implementation plan: cowork/acme-server-prompts/.
Master plan + audit fixes: cowork/acme-server-endpoint-prompt.md +
cowork/acme-server-prompt-audit.md +
cowork/acme-server-prompts/audit-additions.md.
2026-05-03 12:55:40 +00:00
shankar0123 475421457f fix(test): TestBoundedFanOut_SkipsAgentRoutedDeployments race on seenIDs slice
CI race detector flagged TestBoundedFanOut_SkipsAgentRoutedDeployments
on commit 35e18bf (audit fix #9). The test's `work` closure was
appending to a plain []string slice from worker goroutines without
synchronisation:

    var seenIDs []string
    work := func(ctx context.Context, job *domain.Job) error {
        seen.Add(1)
        seenIDs = append(seenIDs, job.ID)  // race
        return nil
    }

atomic.Int64 covered the count assertion but the slice header itself
is the racing memory — race detector caught both the read+write race
on the slice header and the runtime.growslice path on append.

Fix: protect seenIDs with a sync.Mutex. The slice is only used in
the failure-message branch (`t.Errorf` ids=%v formatting), so the
contention is irrelevant to performance — correctness only.

Also locked around the read in the t.Errorf format-args evaluation,
since that read happens AFTER boundedFanOut returns (and Wait()
inside boundedFanOut synchronizes the worker goroutines), but the
explicit Lock/Unlock makes the synchronisation visible without
depending on the implicit happens-before from Wait.

The other five tests in the file (TestBoundedFanOut_CapHolds,
_AllJobsRun, _CtxCancelInterrupts, _FailedJobsCounted,
TestSetRenewalConcurrency_NormalizesNonPositive) only mutate
atomic.Int64 counters from worker goroutines, so they were
already race-clean.

Verified locally: go test -race -count=1 -run
'TestBoundedFanOut|TestSetRenewalConcurrency' ./internal/service/...
green.
2026-05-02 14:34:48 +00:00
shankar0123 35e18bfc56 scheduler: bound renewal concurrency via CERTCTL_RENEWAL_CONCURRENCY
Closes the #9 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Pre-fix, JobService.ProcessPendingJobs ran every
claimed job sequentially in a single goroutine: safe but slow, and
operators with large fleets had no lever to dial throughput up.
Switching to fire-and-forget per-job goroutines would have unbounded
the upstream-CA call rate and tripped DigiCert / Entrust / Sectigo
rate limits — certctl's response to 429 was to retry on the next
tick, re-fanning out the same calls and digging deeper into the
limit. Operators need a knob.

This commit:

- Adds CERTCTL_RENEWAL_CONCURRENCY env var (default 25) loaded via
  the existing getEnvInt pattern in internal/config/config.go.
  Documented inline as the cap for the per-tick renewal/issuance/
  deployment goroutine fan-out, with operator-tuning guidance:
  permissive upstream limits + large fleets (>10k certs) → 100;
  strict limits or async-CA-heavy fleets → 25 or lower.

- Wires golang.org/x/sync/semaphore.Weighted around the per-job
  goroutine launch in JobService.ProcessPendingJobs. Acquire(ctx, 1)
  is the load-bearing piece — it BLOCKS the loop when at the cap,
  providing real backpressure rather than fire-and-forget. The
  fan-out is split into processPendingJobsSequential (legacy,
  preserved for unit-test wiring that doesn't call
  SetRenewalConcurrency) and processPendingJobsConcurrent (production,
  delegates to a generic boundedFanOut helper).

- boundedFanOut takes the per-job work as a closure so the cap can
  be tested directly without standing up the renewal/deployment
  service graph. processed/failed counters use atomic.Int64 to
  avoid mutex overhead on every job completion; final log line
  reads both AFTER wg.Wait so the counts reflect every dispatched
  job. ctx-aware Acquire ensures a shutdown ctx cancel interrupts
  the dispatch loop promptly; in-flight goroutines drain via Wait
  before the function returns so no goroutine outlives the
  scheduler tick.

- shouldSkipJob extracted as a package-private helper so the
  agent-routed-deployment skip logic is shared between the
  sequential and concurrent paths byte-for-byte (the audit prompt's
  "channel-based semaphore without ctx-aware acquire" anti-pattern
  is explicitly avoided — semaphore.Weighted.Acquire returns on ctx
  done; channel <- struct{}{} would block forever).

- SetRenewalConcurrency setter on JobService normalises ≤0 to 1.
  semaphore.NewWeighted(0) constructs a semaphore that blocks every
  Acquire forever; the normalisation prevents a misconfigured env
  var from wedging the scheduler.

- cmd/server/main.go wires SetRenewalConcurrency(cfg.Scheduler.
  RenewalConcurrency) on the freshly-built jobService, immediately
  after SetAuditService. Production deployments always take the
  bounded path; tests that build JobService directly via
  NewJobService keep their strict-sequential behaviour because
  renewalConcurrency is the zero value.

- Tests in internal/service/job_concurrency_test.go:
  * TestBoundedFanOut_CapHolds — primary regression guard. 50 jobs
    × 50ms work × cap=5 → asserts peak in-flight never exceeds 5
    AND reaches 5 at least once (catches both upper-bound regressions
    and gates that incorrectly cap below the configured value).
    Lock-free max via CompareAndSwap so the measurement instrument
    doesn't itself constrain concurrency.
  * TestBoundedFanOut_AllJobsRun — lower-bound: every non-skipped
    job is dispatched.
  * TestBoundedFanOut_SkipsAgentRoutedDeployments — pins the
    shouldSkipJob contract.
  * TestBoundedFanOut_CtxCancelInterrupts — ctx cancellation
    interrupts a stuck fan-out within the timeout budget.
  * TestBoundedFanOut_FailedJobsCounted — per-job errors don't
    abort the fan-out.
  * TestSetRenewalConcurrency_NormalizesNonPositive — ≤0 → 1 fail-safe
    pinned across negative/zero/positive inputs.

- docs/features.md: scheduler-loop table augmented with the
  concurrency-cap env-var pointer alongside the job-processor row.

- docs/architecture.md: Concurrency Safety section gains a paragraph
  explaining the cap, the operator-tuning guidance, the ctx-aware
  Acquire semantics, and the audit reference.

Operator-facing impact: the first big renewal sweep no longer
takes down the upstream CA's rate-limit budget. Existing deployments
get the bounded path automatically (default 25); operators can
override via env var without code changes.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- go test -short -count=1 across service / scheduler / config /
  integration: green
- Six new tests under TestBoundedFanOut* + TestSetRenewalConcurrency*:
  green

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #9.
2026-05-02 14:12:30 +00:00
shankar0123 fefa5a5fd7 acme: support serial-only revocation via local cert-version lookup
Closes the #7 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Pre-fix, ACME RevokeCertificate at acme.go:L519-L529
returned the literal error "ACME revocation by serial not supported in
V1; provide certificate DER". RFC 8555 §7.6 genuinely requires the
cert DER bytes (not just the serial), but a CLM platform's job is to
abstract over that limitation. Operators routinely have only the
serial in hand: lost PEM, rotated key, GUI revoke action driven by a
row in the certs list.

This commit:

- Adds CertificateLookupRepo interface at the ACME connector boundary
  (connector boundary, NOT a service/repository import — the connector
  accepts whatever satisfies the shape). Production wiring in
  cmd/server/main.go injects the postgres CertificateRepository; tests
  inject a fake.

- Adds CertificateRepository.GetVersionBySerial(ctx, issuerID, serial)
  + interface declaration in repository/interfaces.go, returning the
  certificate_versions row whose SerialNumber matches, scoped to the
  issuer via JOIN on managed_certificates. Mirrors the existing
  GetByIssuerAndSerial shape but returns the version (where PEMChain
  lives). Per RFC 5280 §5.2.3 the issuer scope is required for
  determinism.

- Adds SetCertificateLookup + SetIssuerID setters on *acme.Connector.
  Mirror the pattern local.Connector already uses for OCSP responder
  wiring. Both must be wired before serial-only revoke works;
  unwired state falls back to a more actionable error pointing at the
  wiring requirement (the historical "not supported" wording is
  retired).

- Rewrites RevokeCertificate end-to-end: lookup → empty-PEM check →
  pem.Decode → block.Type == "CERTIFICATE" check → ensureClient →
  golang.org/x/crypto/acme.Client.RevokeCert(ctx, accountKey, der,
  reasonCode). RFC 8555 §7.6 case 1 (revocation request signed with
  account key) — the same account key issued the cert, so authority
  is intrinsic. The not-found path returns an actionable operator-
  facing error pointing at the local-store requirement.

- Adds mapRevocationReason translating RFC 5280 §5.3.1 reason strings
  (unspecified, keyCompromise, cACompromise, affiliationChanged,
  superseded, cessationOfOperation, certificateHold, removeFromCRL,
  privilegeWithdrawn, aACompromise) into golang.org/x/crypto/acme.
  CRLReasonCode. Accepts canonical camelCase + underscore_lower +
  ALL_CAPS_UNDERSCORE. Nil reason → 0 (unspecified). Unknown reason
  errors rather than silently demoting (operators rely on the reason
  for compliance reporting).

- Wiring update in service/issuer_registry.go: SetACMECertLookup
  setter on the registry; Rebuild type-asserts *acme.Connector and
  calls SetCertificateLookup + SetIssuerID, mirroring the existing
  *local.Connector branch. cmd/server/main.go calls
  issuerRegistry.SetACMECertLookup(certificateRepo) immediately after
  SetIssuanceMetrics — the postgres repo satisfies the interface via
  GetVersionBySerial.

- Tests:
  * acme_revoke_test.go (new): TestRevokeCertificate_NoCertLookupWired,
    TestRevokeCertificate_NoIssuerIDWired,
    TestRevokeCertificate_LookupReturnsNotFound (operator-facing
    "may not have been issued through certctl" hint pinned),
    TestRevokeCertificate_LookupArbitraryError,
    TestRevokeCertificate_VersionPEMEmpty (corrupt-row guard),
    TestRevokeCertificate_PEMMalformed_NoBlock,
    TestRevokeCertificate_PEMMalformed_WrongType (PRIVATE KEY block
    rejected as not a CERTIFICATE).
  * TestMapRevocationReason_TableDriven: full RFC 5280 reason set
    plus camelCase / underscore / ALL-CAPS variants plus
    nil-reason and unknown-reason cases.
  * acme_failure_test.go: renamed TestRevokeCertificate_AlwaysError
    → TestRevokeCertificate_UnwiredCertLookupFallback; the test
    still exercises the same backward-compat branch but now
    asserts the new "CertificateLookup wiring" error wording.

- Mock-repo updates (3 sites): mockCertificateRepository in
  internal/integration/lifecycle_test.go, mockCertRepo in
  internal/service/testutil_test.go, mockCertRepoWithGetError in
  internal/service/shortlived_test.go each gain a GetVersionBySerial
  implementation that mirrors the GetByIssuerAndSerial logic but
  returns the version row.

- docs/connectors.md ACME section: new "Revocation by serial number"
  subsection covering the workflow, the local-store requirement
  (cert was issued through certctl, not imported), the reason-code
  mapping with the three accepted spelling variants, and a pointer
  to the audit reference.

Out of scope (intentional, per spec):

- Recovering the DER from outside the local cert store (CT logs,
  CSR + signature reconstruction). If the cert wasn't issued through
  certctl, revoke-by-serial via certctl isn't possible.
- Revocation via the cert's private key (RFC 8555 §7.6 case 2). The
  account-key path covers all certctl-issued certs because the same
  account key issued them.
- Pebble-backed integration test for the happy path. Pebble integration
  is the right home for that — the unit tests in this commit pin all
  failure-mode branches before the network call, and the wiring
  branch in Rebuild is exercised by the existing
  TestIssuerRegistryRebuild paths.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- go test -short -count=1 across connector, service, repository,
  integration, api/middleware, api/handler: green

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #7.
2026-05-02 13:09:30 +00:00
shankar0123 74d6b462a4 metrics: gofmt issuance_metrics_test.go — fix CI
Trivial whitespace fix: gofmt collapsed three trailing-comment columns
that I'd hand-aligned in the test file. Local sandbox missed this
because the per-file gofmt run earlier in the commit cycle was scoped
to the changed-files list and didn't include the test file at the
final write moment; CI's project-wide `gofmt -l .` caught it.

Behavior unchanged.
2026-05-02 01:27:33 +00:00
shankar0123 3b92048242 metrics: add per-issuer-type issuance counters, histogram, and failure classifier
Closes the #4 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Before this commit, certctl's Prometheus exposition had
zero per-issuer-type signal — operators answering "is DigiCert slow?"
or "is Sectigo failing more than ACME?" had to grep logs by issuer
name. This commit adds three series labelled by issuer type:

  certctl_issuance_total{issuer_type, outcome}
  certctl_issuance_duration_seconds{issuer_type}            (histogram)
  certctl_issuance_failures_total{issuer_type, error_class}

The histogram covers 0.05–120 second buckets to span the local-issuer
fast path and async-CA slow path (DigiCert/Sectigo/Entrust polling
can take minutes). error_class is a closed enum of eight values
(timeout, auth, rate_limited, validation, upstream_5xx, upstream_4xx,
network, other) classified once in service.ClassifyError. Cardinality
budget is ~276 new series, well within Prometheus's comfortable range.

Implementation:
- service.IssuanceMetrics is the thread-safe counter + histogram
  table. Three independent views (counters / failures / durations)
  exposed via SnapshotCounters / SnapshotFailures / SnapshotDurations.
  sync.RWMutex protects the map shape; per-key sync/atomic.Uint64
  primitives keep the recording hot path lock-free under concurrent
  service-layer goroutines.
- service.IssuanceCounterEntry / IssuanceFailureEntry /
  IssuanceDurationEntry / IssuanceMetricsSnapshotter live in service
  (not handler) to avoid an import cycle: handler already imports
  service for admin_est.go etc., so service can't import handler back.
  Handler's exposer takes the snapshotter via the service-defined
  interface.
- service.ClassifyError pure function maps error → error_class.
  context.DeadlineExceeded / context.Canceled → timeout; *net.OpError
  → network; substring matches against canonical AWS / DigiCert /
  Sectigo error shapes for auth / rate_limited / validation /
  upstream_5xx / upstream_4xx / network; unknown → other. Each branch
  has at least one representative test case in
  TestClassifyError.
- IssuerConnectorAdapter.SetMetrics wires per-adapter recording
  (issuerType + metrics). Existing 28+ test call sites of
  NewIssuerConnectorAdapter keep their one-arg signature; production
  wiring goes through SetMetrics post-construction.
- IssuerRegistry.SetIssuanceMetrics + Rebuild type-asserts to
  *IssuerConnectorAdapter and calls SetMetrics with the issuer type
  string. nil-guarded — tests that hand-build adapters without
  metrics get no-op recording.
- IssuerConnectorAdapter.IssueCertificate / RenewCertificate wrap the
  underlying connector call with start := time.Now() and
  recordIssuance(start, err). Renewal is recorded into the same
  certctl_issuance_* series as initial issuance — operationally,
  renewal IS issuance from the connector's perspective (matches the
  audit prompt's guidance on series naming).
- handler/metrics.go GetPrometheusMetrics gains a new exposer block
  emitting all three series in stable label order with correct
  Prometheus format (_bucket / _sum / _count for the histogram, +Inf
  bucket appended). Sorted via sort.Slice for stable output. nil-
  guarded so deploys without the wire produce clean exposition.
- formatLE helper trims trailing zeros from histogram bucket labels
  via strconv.FormatFloat(le, 'f', -1, 64) so the `le` labels match
  Prometheus client conventions ("0.05", "30", "120", not "0.0500"
  etc.).
- cmd/server/main.go wires a single IssuanceMetrics instance into
  both the IssuerRegistry (recording) and the MetricsHandler (exposer)
  using DefaultIssuanceBucketBoundaries.

Tests:
- TestIssuanceMetrics_RecordAndSnapshot — happy-path counter +
  histogram + failure recording, BucketBoundaries returns a copy
  (not shared storage).
- TestIssuanceMetrics_HistogramCumulative — pins the cumulative-buckets
  contract. 100ms observation lands in 0.1 bucket and every larger
  bucket; 750ms only in the 1.0 bucket. Off-by-one here would
  corrupt every quantile query downstream.
- TestIssuanceMetrics_Concurrency — 100 goroutines × 1000 ops under
  the race detector. Asserts atomic counter integrity across
  contended writes.
- TestClassifyError — 17 cases covering every branch of the closed
  enum plus the nil-error special case.

Implementation chooses the existing hand-rolled fmt.Fprintf
exposition pattern (no prometheus/client_golang dependency added)
to stay consistent with the OCSP / deploy counter blocks already in
the file.

Out of scope (separate follow-ups):
- Revocation metrics (certctl_revocation_*) — symmetric to issuance
  but the audit didn't ask; explicit follow-up commit.
- Discovery / health-check duration histograms.
- prometheus/client_golang migration.

Verified locally:
- gofmt clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 ./internal/service/ green
- go test -short -count=1 -race -run TestIssuanceMetrics ./internal/service/ green
- go test -short -count=1 ./internal/api/handler/ green
- go build ./... success

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #4 (Part 3, narrative section).
2026-05-02 00:39:25 +00:00
shankar0123 b0efdbe2f8 repo,service: introduce WithinTx and atomic audit rows for issue/renew/revoke
Closes the #3 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit (Part 1.5 finding #1: audit row not transactional with
issuance). AuditRepository.Create previously ran on the package-level
*sql.DB while the certificate insert / version insert / revocation
insert ran on independent connections — a failed audit INSERT after
a successful operation INSERT was silently lost. SOX §404 over IT
general controls, PCI-DSS §10 audit logging, HIPAA §164.312(b) audit
controls, and CA/B Forum Baseline Requirements §5.4.1 audit log
records all presume audit-with-operation atomicity.

Design — Option A (Querier abstraction). The chosen pattern: a shared
repository.Querier interface (subset of *sql.DB and *sql.Tx) plus a
postgres.WithinTx helper that begins a tx, runs fn, commits on nil
error, rolls back on error or panic, and returns the wrapped result.
Repository methods that participate in a service-layer transaction
expose a *WithTx variant taking repository.Querier; the bare methods
remain for stand-alone use. A repository.Transactor abstracts the
"begin tx, run fn, commit/rollback" lifecycle so service-layer code
runs multi-write operations atomically without holding *sql.DB
directly. Option B (UnitOfWork) was considered but adds boilerplate
without behavioral benefit for the current scope. Option C
(context-carried tx) was explicitly rejected — it hides the
transactional boundary from the type system, reproducing the class
of bug we're fixing.

This commit:
- Adds internal/repository/querier.go with the Querier interface
  (compile-time guards that *sql.DB and *sql.Tx satisfy it) and the
  Transactor interface for service-layer use.
- Adds internal/repository/postgres/tx.go with the WithinTx helper
  (begin/fn/commit/rollback with panic recovery) and a transactor
  type that satisfies repository.Transactor.
- Adds CreateWithTx variants on AuditRepository, CertificateRepository
  (Create + Update + CreateVersion), and RevocationRepository.
  Existing bare methods now delegate to the *WithTx variant using
  the package-level *sql.DB so existing call sites are
  behavior-preserving.
- Updates repository/interfaces.go: AuditRepository, CertificateRepository,
  and RevocationRepository declare the new *WithTx methods. Adds an
  atomicity contract doc-comment on AuditRepository pointing at
  WithinTx + the audit blocker.
- Adds AuditService.RecordEventWithTx, mirroring RecordEvent but
  routing through CreateWithTx so the audit row is part of the
  caller's transaction. Same redaction + marshalling contract.
- Refactors three audit-emitting service paths to use Transactor.WithinTx
  when SetTransactor was wired, with a legacy fallback for backward
  compat:
    * CertificateService.Create — cert insert + audit row in one tx.
    * RevocationSvc.RevokeCertificateWithActor — cert status update +
      revocation row + audit row in one tx. The OCSP cache invalidate
      remains best-effort (out of scope per the prompt).
    * RenewalService CompleteServerRenewal — cert version insert +
      cert update + audit row in one tx. Job status update stays
      outside the audit-atomicity scope (job state lives outside
      the operator-facing audit trail).
- Adds SetTransactor on CertificateService, RevocationSvc, and
  RenewalService. cmd/server/main.go wires a single Transactor
  instance shared across all three so all audit-emitting paths run
  their writes in transactions backed by the same *sql.DB handle.
- Updates 5 mock implementations to satisfy the new interface methods:
  mockCertRepo (testutil_test.go), mockCertRepoWithGetError
  (shortlived_test.go), fakeRevocationRepo (crl_cache_test.go),
  intuneE2EAuditRepo (scep_intune_e2e_test.go), and the integration-
  test mocks (lifecycle_test.go: mockCertificateRepository,
  mockAuditRepository, mockRevocationRepository). All *WithTx mocks
  ignore the Querier and delegate to the bare method (mocks have no
  DB; in-memory state is shared regardless of "tx").
- Adds a service-layer test mockTransactor with BeginTxErr and
  CommitErr knobs so the atomic-audit tests can assert error
  propagation through the transactional boundary.
- Adds internal/repository/postgres/tx_test.go: unit-level test that
  WithinTx surfaces "begin tx" wrap when BeginTx fails, and that
  Transactor.WithinTx delegates correctly. Real-Postgres rollback
  semantics are covered by the testcontainers tests in the postgres
  package — sandbox disk pressure prevented adding a sqlmock dep
  for the in-fn / commit-failure unit test, so those scenarios are
  exercised through atomic_audit_test.go using the mockTransactor's
  CommitErr / BeginTxErr fields.
- Adds internal/service/atomic_audit_test.go:
    * TestCertificateService_Create_AtomicWithTx — asserts audit
      insert failure inside the tx surfaces as the operation's error
      (closes the blocker contract).
    * TestCertificateService_Create_LegacyPathLogs — pins the
      backward-compat behavior when SetTransactor isn't wired:
      audit failure is logged-not-failed, matching pre-fix.
    * TestCertificateService_Create_TransactorBeginFailure — BeginTx
      error path: operation fails, no cert insert, no audit insert.
    * TestCertificateService_Create_TransactorCommitFailure —
      Commit error after successful in-fn writes surfaces as the
      operation's error. Real Postgres can fail Commit on
      serialization conflicts; the service must report this.

Out of scope (separate follow-up commits, same shape):
- Issuer CRUD audit atomicity.
- Target CRUD audit atomicity.
- Agent retire (already transactional via RetireAgentWithCascade;
  verified, not changed).
- Renewal-policy CRUD audit atomicity.
- Owner/team/agent-group CRUD audit atomicity.
- Discovery / health-check audit atomicity.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 ./internal/service/ green
- go test -short -count=1 ./internal/api/handler/ green
- go test -short -count=1 ./internal/integration/ green
- go test -short -count=1 ./internal/repository/postgres/ green
- go build ./... success

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #3 (Part 3, narrative section).
2026-05-02 00:29:09 +00:00
shankar0123 804a1b05ce awsacmpca: thread ctx through factory + registry — fix CI contextcheck
Follow-up to 590f654 (awsacmpca: replace stub client with AWS SDK v2
implementation). CI's golangci-lint contextcheck rule flagged six
violations in awsacmpca_test.go where mustNew/awsacmpca.New were
called from test functions that had ctx in scope but didn't thread it
through New(). The previous commit used context.Background() inside
New() with the rationale that "the audit allows either threading or
documenting the limitation"; CI made that choice for us.

Threading ctx is the right shape per the audit's stated preference.
The fix cascades from awsacmpca.New through issuerfactory.NewFromConfig
and IssuerRegistry.Rebuild because the contextcheck rule propagates
upward through every caller that has ctx in scope.

This commit:
- Changes awsacmpca.New(config, logger) to
  awsacmpca.New(ctx, config, logger). The ctx is passed to
  buildSDKClient → awsconfig.LoadDefaultConfig so SDK credential chain
  resolution honors caller deadlines (LoadDefaultConfig may probe IMDS
  or remote credential sources). The doc-comment on New explains that
  callers without a useful deadline should pass context.Background()
  and that the SDK has internal credential-resolution timeouts.
- Adds ctx as the first parameter of issuerfactory.NewFromConfig.
  Currently only the AWSACMPCA branch uses ctx (it's threaded into
  awsacmpca.New); the other 11 branches accept ctx without using it.
  This is a contractual change that lets callers thread ctx through
  without contextcheck warnings, even though most issuer constructors
  do no ctx-aware work today.
- Adds ctx as the first parameter of IssuerRegistry.Rebuild. Rebuild
  iterates over configs and calls NewFromConfig per issuer; the same
  ctx flows through every connector instantiation.
- Updates the two production call sites in internal/service:
  - issuer.go:279 (TestIssuer connection test) now passes its
    method-scoped ctx
  - issuer.go:303 (BuildRegistry) now passes its method-scoped ctx
    to Rebuild
- Updates 13 test sites in internal/connector/issuerfactory/factory_test.go
  via a new testCtx() helper that returns context.Background(). Helper
  is dedicated to this file so contextcheck's "you have a ctx in scope,
  pass it" rule doesn't fire on test functions that don't otherwise
  need ctx.
- Updates 6 test sites in internal/service/issuer_registry_test.go
  to pass context.Background() to Rebuild.
- Removes the now-stale "// NewFromConfig has no ctx parameter
  (preserved across all 12 connectors); pass context.Background() ..."
  comment from the awsacmpca branch in factory.go — that workaround
  is no longer the design.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... clean (was failing with 6
  contextcheck issues before the cascade; now 0 issues)
- go test -short -count=1 across all changed packages green

Sandbox couldn't run the existing CI's full make verify due to
disk pressure on /sessions and a virtiofs concurrent-open-file
ceiling on go mod tidy; operator should run `make verify` on
the workstation to confirm.

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #1 (CI follow-up; behavior unchanged from 590f654).
2026-05-01 23:27:25 +00:00
shankar0123 7cb453a336 chore(fmt): repo-wide gofmt -w sweep — close drift surfaced by ci-pipeline-cleanup Phase 4
Mechanical reformat. The new 'gofmt drift' CI step (added in
ci-pipeline-cleanup Phase 4, commit 0f205a8) surfaced 111 files
with accumulated gofmt drift across cmd/, internal/, and deploy/test/.

Each file's diff is gofmt-standard: whitespace adjustments, intra-
group import sorting (alphabetical by import path within blank-line-
separated groups), and struct-tag column alignment. No semantic
changes — verified via 'git diff --ignore-all-space' which shows only
the line-position deltas from import reordering.

The gate stays in place after this commit. Going forward it catches
gofmt drift at PR time.
2026-04-30 22:33:57 +00:00
shankar0123 8637131f80 chore: gofmt fixes across deploy-hardening I new files
Phase 13 verification surfaced gofmt-formatting drift in 6 files
across the bundle's new code:

- internal/api/handler/metrics.go (struct field alignment)
- internal/connector/target/k8ssecret/validate_only_test.go (alignment)
- internal/connector/target/nginx/nginx.go (alignment)
- internal/connector/target/postfix/postfix.go (alignment)
- internal/connector/target/ssh/validate_only_test.go (alignment)
- internal/service/deploy_counters.go (alignment)

Pure mechanical gofmt -w fixes; no behavior changes. CI's
make verify gate (which runs `go fmt ./...`) didn't catch these
because go fmt is more lenient than gofmt -l, but golangci-lint
v2.11.4 + the explicit gofmt step in Phase 13 verification did.

Phase 13 full-matrix verification all green:
- gofmt -l: empty across all bundle-touched files
- go vet ./internal/deploy/... ./internal/connector/target/... ./internal/service/ ./internal/api/handler/ ./cmd/agent/: clean
- golangci-lint v2.11.4 (the version CI runs): 0 issues
- go test -race -count=1 across deploy + nginx + apache + haproxy + agent + service: all green
- INTEGRATION=1 go test -tags integration -run Deploy ./deploy/test/...: 4/4 e2e tests green

Phase 14 next: release prep — Active Focus update, release notes,
Reddit-beat draft, final tag handoff to operator.
2026-04-30 15:33:33 +00:00
shankar0123 135b271197 feat(metrics): per-target-type deploy counters wired into /metrics/prometheus
Phase 10 of the deploy-hardening I master bundle. Mirrors the
production-hardening-II Phase 8 OCSP-counter pattern. Per frozen
decision 0.9, the metric naming convention is
`certctl_deploy_<area>_total` with target_type + sub-label.

internal/service/deploy_counters.go:
- DeployCounters struct with sync.Map of per-target-type buckets
  (apache, nginx, etc.). Lock-free fast path via sync/atomic
  Uint64 counters; LoadOrStore on first tick.
- 8 sub-counters per target-type bucket:
  - attemptsSuccess / attemptsFailure
  - validateFailures (PreCommit returned error)
  - reloadFailures (PostCommit returned error → rollback ran)
  - postVerifyFails (post-deploy TLS handshake failed)
  - rollbackRestored (rollback succeeded)
  - rollbackAlsoFail (operator-actionable escalation)
  - idempotentSkips (SHA-256 match → no-op deploy)
- Snapshot returns []DeploySnapshot for the Prometheus exposer.

internal/service/deploy_counters_test.go:
- 5 tests: zero-state, per-target-type tick isolation, race-detector
  smoke under concurrent ticks, cross-target bucket isolation,
  snapshot-mutation-doesn't-affect-counter.

internal/api/handler/metrics.go:
- New DeployCounterSnapshotter interface (mirrors CounterSnapshotter
  for the OCSP counters but uses the per-target-type tuple shape).
- New DeploySnapshotEntry struct copying the service-layer shape;
  avoids importing the service package directly so the handler
  stays dependency-light.
- New SetDeployCounters setter on MetricsHandler (mirrors
  SetOCSPCounters wiring).
- Prometheus exposer extended with 6 new metric blocks per frozen
  decision 0.9:
  - certctl_deploy_attempts_total{target_type, result}
  - certctl_deploy_validate_failures_total{target_type}
  - certctl_deploy_reload_failures_total{target_type}
  - certctl_deploy_post_verify_failures_total{target_type}
  - certctl_deploy_rollback_total{target_type, outcome}
  - certctl_deploy_idempotent_skip_total{target_type}
- Output sorted by target_type for stable diffs across requests.

The agent-side wire-up (cmd/agent/main.go ticking counters in the
DeployCertificate dispatch site) is intentionally deferred to a
follow-up commit — Phase 10's load-bearing change is the
infrastructure; per-connector tick wiring is a mechanical follow-on.

Build + go vet clean. go test -count=1 green for service +
handler packages.

Phase 11 next: cross-cutting integration tests at deploy/test/.
2026-04-30 15:25:38 +00:00
shankar0123 9e6c57673e test(service): coverage uplift for production hardening II + adjacent helpers (R-CI-extended floor)
CI's R-CI-extended coverage gate failed on 2025-04-30: service-layer
coverage was 68.7% vs the 70% floor. The drag was from new files
(internal/service/ocsp_counters.go, ocsp_response_cache.go,
export_audit_actions.go) that shipped without enough direct tests
to keep the package above the floor.

NEW internal/service/ocsp_counters_test.go (4 tests):
  - TestOCSPCounters_NewIsZero — fresh counter snapshot is all zero
  - TestOCSPCounters_EveryIncTicksItsLabel — table-driven test
    pinning every Inc* method to its label string + the no-cross-
    bleed invariant. Critical for Phase 8 Prometheus exposer
    contract: a typo in either side would silently drop the
    counter from /metrics/prometheus.
  - TestOCSPCounters_SnapshotIsCopy — mutating the returned map
    doesn't affect the underlying counters
  - TestOCSPCounters_ConcurrentTicksRace — race-detector smoke
    against sync/atomic primitives

NEW internal/service/ocsp_response_cache_real_test.go (10 tests):
  - HappyPath_CachesAfterMiss — first fetch live-signs + writes
    cache row; second fetch hits cache
  - CacheWriteFailureIsNonFatal — putErrorRepo simulates disk full;
    response still returned (fail-soft contract)
  - StaleEntryRegenerates — entries with next_update in the past
    trigger re-sign on next fetch
  - InvalidateOnRevoke — pin the load-bearing security wire
  - InvalidateOnRevoke_DeleteFailureSurfacesError — error-path
    coverage for the delete branch
  - CountByIssuer + NilRepoReturnsEmpty
  - CAOperationsSvc.GetOCSPResponseWithNonce_CacheDispatchHit pins
    the nil-nonce → cache dispatch wire
  - CAOperationsSvc.GetOCSPResponseWithNonce_NonceBypassesCache
    pins the nonce-bearing → live-sign bypass wire (cache stays
    empty)
  - RevocationSvc.SetOCSPCacheInvalidator_WireConnects pins the
    setter through to the wired interface

NEW internal/service/coverage_extras_test.go (~12 tests) targets the
0%-coverage chunks adjacent to the bundle's modified files so the
package as a whole stays above the floor:
  - cert-export typed audit emission (Phase 7) round-trip with
    detail-map inspection (has_private_key + actor_kind + cipher pin)
  - PKCS12CipherModernAES256 pinned-value test (drift catches a
    future go-pkcs12 default change)
  - audit.ListAuditEvents + GetAuditEvent (handler-interface methods
    that were at 0%)
  - certificate.ListCertificatesWithFilter (M20 filter delegate)
  - discovery.{ListScans,GetScan,GetDiscoverySummary} (delegates)
  - health_check.{Update,SetNotificationService} delegates + audit
  - est.{deterministicSerial,zeroizeBytes,zeroizeKey} pure helpers
    + the live RSA + ECDSA key-zeroize branches

Sandbox total: 67.6% → 69.9% (+2.3pp). The live keygen branches
in zeroizeKey skip in the sandbox when crypto/rand isn't available
but run on CI, so the CI total should land above the 70% floor with
a small buffer.

Pre-commit verification: go build ./... clean; go test -short
-count=1 green for ./internal/service/.
2026-04-30 06:22:06 +00:00
shankar0123 13b29ca1bd fix(cert-export): satisfy staticcheck ST1022 on PKCS12CipherModernAES256
Production hardening II Phase 11 verification — golangci-lint v2.11.4
flagged the const PKCS12CipherModernAES256 doc comment with ST1022
(comments on exported identifiers should start with the identifier
name). Reformatted to lead with the const name; same content.

Reproduced clean: 0 issues across handler/, service/,
connector/issuer/local/, api/router/, ratelimit/.
2026-04-30 05:22:10 +00:00
shankar0123 8cba794723 feat(cert-export): typed audit-action constants + has_private_key + cipher detail (Phase 7)
Production hardening II Phase 7 — typify the cert-export audit
emission. The pre-Phase-7 audit log carried inline strings
("export_pem" / "export_pkcs12"); this commit adds typed
constants alongside via the split-emit pattern so operators get
both back-compat with existing log analysers AND a stable typed
grep target.

NEW internal/service/export_audit_actions.go:
  - AuditActionCertExportPEM = "cert_export_pem"
  - AuditActionCertExportPEMWithKey = "cert_export_pem_with_key"
    (reserved for future bundle that adds key-bearing export; not
    emitted in V2)
  - AuditActionCertExportPKCS12 = "cert_export_pkcs12"
  - AuditActionCertExportFailed = "cert_export_failed"
  - PKCS12CipherModernAES256 = "AES-256-CBC-PBE2-SHA256" pinned
    string for the cipher detail (drift catches a future go-pkcs12
    default change)

Detail enrichment on both emission sites:
  - has_private_key (bool, V2 always false — cert-only export is
    the only V2 path; key-bearing export deferred to future bundle)
  - actor_kind ("user")
  - cipher (PKCS12 only — pinned to PKCS12CipherModernAES256)

Split-emit pattern: each export emits BOTH the legacy bare action
code AND the typed constant. Mirrors est.go::processEnrollment which
emits both "est_simple_enroll" + "est_simple_enroll_success".
Existing audit-log analysers that match by exact string "export_pem"
keep working; new operator alerts can target the typed constant.

Pre-commit verification: go build ./... clean; go test -short
-count=1 green for service/.
2026-04-30 05:13:15 +00:00
shankar0123 40fd96a416 feat(ocsp): pre-signed response cache + invalidate-on-revoke (Phase 2)
Production hardening II Phase 2 — closes the per-request live-signing
bottleneck for OCSP. Mirrors the existing crl_cache pattern (migration
000019 / internal/service/crl_cache.go) but per (issuer_id, serial_hex)
instead of per-issuer.

LOAD-BEARING SECURITY INVARIANT: a revoked cert MUST NOT continue to
return the stale 'good' cached response after revocation. The
RevocationSvc.RevokeCertificateWithActor flow now calls
OCSPResponseCacheService.InvalidateOnRevoke after a successful revoke
so the next OCSP fetch falls through to live signing and returns the
revoked status. Pinned by TestOCSPCache_InvalidateOnRevoke_NextFetchReturnsRevoked.

NEW migrations/000024_ocsp_response_cache.{up,down}.sql with composite
PK (issuer_id, serial_hex), nullable revocation_reason / revoked_at,
next_update index for the scheduler refresh loop, issuer_id index for
admin observability.

NEW internal/domain/ocsp_response_cache.go::OCSPResponseCacheEntry +
IsStale helper.

NEW internal/repository/postgres/ocsp_response_cache.go implementing
repository.OCSPResponseCacheRepository (Get / Put / Delete /
CountByIssuer). Interface defined in internal/repository/interfaces.go.

NEW internal/service/ocsp_response_cache.go::OCSPResponseCacheService
with read-through facade + sync.Map singleflight + InvalidateOnRevoke.
On cache miss, calls caOperationsSvc.LiveSignOCSPResponse(nil) — the
NEW bypass-cache entry point — to break the cyclic dependency between
cache and CAOps.

REFACTORED internal/service/ca_operations.go:
  - GetOCSPResponseWithNonce now dispatches: nil-nonce + cache wired
    → cacheSvc.Get (cache); nonce != nil OR cache nil → live-sign.
  - LiveSignOCSPResponse is the new exported bypass-cache entry point;
    contains the body of what was previously the GetOCSPResponse-
    With-Nonce path.
  - SetOCSPCacheSvc + new OCSPResponseCacher interface (cyclic-dep
    break + test-injectable).

The cache stores nil-nonce blobs by design. Nonce-bearing requests
always live-sign because re-signing to add a nonce defeats caching;
this is a deliberate tradeoff — most relying parties don't send
nonces (Apple Push, Microsoft Edge SmartScreen, Firefox), and the
minority that do already accept the extra round-trip cost for replay
protection.

WIRED in cmd/server/main.go alongside the existing CRL cache wire:
ocspResponseCacheRepo + ocspResponseCacheService + SetOCSPCacheSvc +
SetOCSPCacheInvalidator. Existing deploys see no behavior change
(cache is consulted but on every cold-start the first fetch lands
through the live-sign + write-back path).

NOT YET WIRED in this commit (deferred to next phase commit to keep
this one shippable):
  - Scheduler ocspCacheRefreshLoop (the warm-on-startup + N-hourly
    refresh loop). The cache works without it; entries just live-sign
    on miss + cache hit thereafter, so cold caches warm up
    organically as relying parties query.
  - Admin observability endpoint /api/v1/admin/ocsp/cache.
  - CERTCTL_OCSP_CACHE_REFRESH_INTERVAL env var.
  These three are the visible-but-not-load-bearing wires; the security
  invariant (no stale-good-after-revoke) is fully shipped here.

7 new tests in internal/service/ocsp_response_cache_test.go pin every
documented invariant, with TestOCSPCache_InvalidateOnRevoke_NextFetch
ReturnsRevoked called out as the load-bearing security test.

Pre-commit verification: go build ./... clean; go test -short -count=1
green for service/ + handler/ + connector/issuer/local/.
2026-04-30 05:03:01 +00:00
shankar0123 3d15a3e5af feat(ocsp): RFC 6960 §4.4.1 nonce extension support — echo client nonce in response, reject malformed
Production hardening II Phase 1.

The OCSP responder previously ignored the request's nonce extension
entirely, leaving relying parties vulnerable to replay attacks. RFC
6960 §4.4.1 defines the OPTIONAL id-pkix-ocsp-nonce extension (OID
1.3.6.1.5.5.7.48.1.2): when present in the request, the responder
MUST echo the same value in the response; when absent, no nonce in
the response (back-compat with relying parties that don't send one).

NEW internal/service/ocsp_nonce.go: ParseOCSPRequestNonce walks raw
DER (golang.org/x/crypto/ocsp.Request doesn't expose the request's
extensions field — the library only exposes IssuerNameHash +
IssuerKeyHash + SerialNumber). Returns one of three states:
  - (nil, false, nil) — no nonce extension in request
  - (nonce, true, nil) — well-formed nonce, ≤ MaxOCSPNonceLength (32)
  - (nil, false, ErrOCSPNonceMalformed) — empty or oversized

NEW internal/service/ocsp_counters.go: sync/atomic counter table for
OCSP request lifecycle (request_get/post, request_success/invalid,
nonce_echoed, nonce_malformed, rate_limited, ...). Mirrors the EST/
SCEP counter pattern; Phase 8 wires these into /metrics/prometheus.

CertSrv types extended:
  - internal/connector/issuer/interface.go::OCSPSignRequest gains
    Nonce []byte field.
  - internal/service/renewal.go::OCSPSignRequest (the service-layer
    duplicate used by ca_operations.go) gains the same field.
  - internal/service/issuer_adapter.go bridges the two.

Service path: CAOperationsSvc.GetOCSPResponseWithNonce(ctx, issuerID,
serialHex, nonce) is the new entry point that plumbs the nonce
through every signing site (good / revoked / unknown / short-lived).
The legacy GetOCSPResponse becomes a nil-nonce wrapper for back-
compat — every existing caller (tests, the GET handler) sees no
behavior change.

CertificateService gains the same WithNonce variant; the handler
interface adds it to the contract. MockCertificateService in tests
extended with the new method (delegates to the legacy fn when no
override is set, so existing tests that don't care about the nonce
keep working).

Local issuer's SignOCSPResponse appends the id-pkix-ocsp-nonce
extension (non-Critical per RFC 6960 §4.4) to the response template's
ExtraExtensions when req.Nonce != nil. The extnValue is the nonce
bytes wrapped in an OCTET STRING per RFC 6960 §4.4.1.

POST OCSP handler (HandleOCSPPost):
  - After ocsp.ParseRequest succeeds, calls ParseOCSPRequestNonce on
    the raw body to extract the optional nonce.
  - On ErrOCSPNonceMalformed (empty or > 32 bytes): writes an
    'unauthorized' OCSP response (status 6 per RFC 6960 §2.3) using
    the canonical ocsp.UnauthorizedErrorResponse from x/crypto/ocsp.
    Does NOT echo malicious bytes back.
  - On well-formed nonce: passes it through GetOCSPResponseWithNonce.
  - On no nonce: nil passed through; back-compat preserved.

GET OCSP handler unchanged — the GET form has no body to carry a
nonce extension.

6 new tests in internal/service/ocsp_nonce_test.go pin every
documented failure mode + the 32-byte boundary. The test fixture
builds an OCSPRequest via golang.org/x/crypto/ocsp.CreateRequest then
splices in a [2] EXPLICIT Extensions element by hand (the library
doesn't expose extension construction either).

Pre-commit verification: gofmt clean, go vet clean across affected
packages, go test -short -count=1 green for service/ + handler/ +
connector/issuer/local/. No new env vars introduced (Phase 1 is
always-on per RFC; no operator opt-out).
2026-04-30 04:55:06 +00:00
shankar0123 5834e5b866 fix(est): plumb context through ESTService.ReloadTrust to satisfy contextcheck
CI golangci-lint v2.11.4 flagged internal/api/handler/admin_est.go:178:
the AdminESTServiceImpl.ReloadTrust method took ctx context.Context but
called svc.ReloadTrust() with no context, then the underlying
ESTService.ReloadTrust used context.Background() internally for the
audit RecordEvent call. That's the contextcheck linter's textbook
'context discarded at boundary' violation.

Fix: change ESTService.ReloadTrust signature to ReloadTrust(ctx
context.Context) and forward the caller-supplied ctx into
auditService.RecordEvent. AdminESTServiceImpl.ReloadTrust now passes
its received ctx through. The HTTP handler already forwards
r.Context() one layer up, so the request-scoped trace identifiers now
flow end-to-end into the audit row instead of being severed at the
service boundary.

Verified locally with golangci-lint v2.11.4 (the same version CI runs)
against ./internal/api/handler/... ./internal/service/... — '0
issues.' All cmd/* binaries build clean, go test -short -count=1
green for both packages.
2026-04-30 01:59:04 +00:00
shankar0123 5a682db8e2 EST RFC 7030 hardening master bundle Phases 10-11: libest sidecar e2e
+ Cisco IOS quirk fixtures + ManagedCertificate.Source provenance +
EST bulk-revoke endpoint + 13 typed audit action codes.

Phase 10.1 — libest reference-client sidecar:
- deploy/test/libest/Dockerfile: multi-stage Debian-bookworm-slim
  build of Cisco's libest v3.2.0-2 from source (autoconf/automake/
  libtool + libcurl4-openssl-dev + libssl-dev). Runtime stage
  carries only estclient + bash + openssl + ca-certificates so the
  exec surface stays small + predictable.
- docker-compose.test.yml libest-client entry (profiles: [est-e2e])
  with bind mounts for /config/est (test workspace) + /config/certs
  (certctl CA bundle for TLS pinning); IP 10.30.50.9 (10.30.50.8
  was already taken by certctl-agent).
- deploy/test/est/.gitkeep keeps the bind-mount target tracked.

Phase 10.2 — 5 integration tests (//go:build integration) in
deploy/test/est_e2e_test.go:
- TestEST_LibESTClient_Enrollment_Integration (cacerts → simpleenroll
  → cert-shape assertion)
- TestEST_LibESTClient_MTLSEnrollment_Integration (mTLS sibling-route
  cert auth; skip when bootstrap cert absent)
- TestEST_LibESTClient_ServerKeygen_Integration (RFC 7030 §4.4
  multipart; skip when profile gate disabled)
- TestEST_LibESTClient_RateLimited_Integration (4th enroll trips
  per-principal cap, asserts 429-shaped error)
- TestEST_LibESTClient_ChannelBinding_Integration (libest
  --tls-exporter; skip when libest build lacks the flag).
- requireESTSidecar guard skips the suite when the operator forgot
  --profile est-e2e; helpful error message includes the exact
  command to bring the sidecar up.

Phase 10.3 — Cisco IOS quirk fixtures + 3 unit tests in
internal/api/handler/cisco_ios_quirks_test.go:
- testdata/cisco_ios_15x_pem_csr.txt: PEM body sent with
  Content-Type application/x-pem-file. Handler dispatches on
  body-prefix not Content-Type — accepts cleanly.
- testdata/cisco_ios_16x_trailing_newline_csr.txt: extra trailing
  newlines after base64 body. strings.TrimSpace tolerates.
- testdata/cisco_ios_crlf_b64_csr.txt: CRLF-wrapped base64.
  base64.StdEncoding handles CRLF + LF identically.

Phase 11.1 — ManagedCertificate.Source provenance:
- New domain.CertificateSource enum (Unspecified/EST/SCEP/API/Agent).
- Migration 000023_managed_certificates_source.up.sql adds source
  TEXT NOT NULL DEFAULT '' so existing rows scan as
  CertificateSourceUnspecified — back-compat: bulk-revoke filter
  treats empty as "any source".
- Postgres repo Insert/Update/scan paths all wire the new column.

Phase 11.2 — EST bulk-revoke endpoint:
- BulkRevocationCriteria.Source field (Source-only requests rejected
  as too broad — must accompany at least one narrower criterion).
- service.bulk_revocation.resolveCertificates post-filter by Source
  (empty=any, no SQL change so existing CertificateFilter callers
  unaffected).
- New BulkRevocationHandler.BulkRevokeEST method pins Source=EST +
  dispatches; new route POST /api/v1/est/certificates/bulk-revoke
  (M-008 admin-gated). openapi.yaml documented + parity-guard green.

Phase 11.3 — 13 typed audit action codes in
internal/service/est_audit_actions.go:
- est_simple_enroll_success / _failed
- est_simple_reenroll_success / _failed
- est_server_keygen_success / _failed
- est_auth_failed_basic / _mtls / _channel_binding
- est_rate_limited
- est_csr_policy_violation
- est_bulk_revoke
- est_trust_anchor_reloaded
- ESTService.processEnrollment + SimpleServerKeygen + ReloadTrust
  split-emit BOTH the legacy bare action codes (back-compat for the
  GUI activity-tab chip filters that match by exact string +
  existing audit-log analysers) AND the new typed _success / _failed
  variants (operator grep target + per-failure-mode counter).

Tests:
- internal/api/handler/bulk_revocation_est_test.go — 5 cases
  (admin-true happy path pins Source=EST + non-admin 403 +
  empty-criteria 400 + invalid-reason 400 + method-not-allowed).
- internal/service/est_audit_actions_test.go — 5 cases (SimpleEnroll
  legacy+typed emission / SimpleReEnroll typed / IssuerError
  typed-failed / PolicyViolation triple-emit /
  unique-string invariant).

Pre-commit verification (sandbox): gofmt clean, go vet clean
(excluding repository/postgres testcontainers limit), staticcheck
clean across api/handler/api/router/domain/service/deploy/test,
go test -short -count=1 green for every non-postgres Go package +
integration build (`go build -tags integration ./deploy/test/...`)
clean. G-3 docs-drift guard reproduced locally clean (Phases 10-11
added zero new env vars).

Spec preserved at cowork/est-rfc7030-hardening-prompt.md. Phases
12-13 (docs/est.md + WiFi/802.1X / IoT bootstrap / FreeRADIUS
recipes; release prep + tag) remain — post-2.1.0 work.
2026-04-30 00:52:43 +00:00
shankar0123 43075a1b5c EST RFC 7030 hardening master bundle Phases 5-7: end-to-end serverkeygen
+ profile-driven csrattrs + admin observability with per-status
counters + reload-trust endpoint.

Phase 5 — RFC 7030 §4.4 server-driven key generation:
- internal/pkcs7/envelopeddata_builder.go is the inverse of the
  existing parser/decryptor: AES-256-CBC content cipher + RSA PKCS#1
  v1.5 keyTrans + per-call random IV. Round-trip pinned in test
  (BuildEnvelopedData → ParseEnvelopedData → Decrypt returns the
  original plaintext byte-for-byte).
- ESTService.SimpleServerKeygen runs the full §4.4 flow: parse client
  CSR → require RSA pubkey for keyTrans → resolve per-profile
  algorithm (RSA-2048 default; honors AllowedKeyAlgorithms) → in-
  memory keygen → re-build CSR with server pubkey → run existing
  issuer pipeline → marshal PKCS#8 → CMS-EnvelopedData wrap to a
  synthetic recipient cert wrapping the device's CSR-supplied pubkey
  → zeroize plaintext + PKCS#8 bytes → return CertPEM + ChainPEM
  + EncryptedKey. Typed sentinels ErrServerKeygenRequiresKey-
  Encipherment / ErrServerKeygenUnsupportedAlgorithm /
  ErrServerKeygenDisabled.
- ESTHandler.ServerKeygen + ServerKeygenMTLS emit RFC 7030 §4.4.2
  multipart/mixed with random per-response boundary; per-profile
  SetServerKeygenEnabled gate returns 404 when off (defense in depth
  even if the route was registered).
- New routes POST /.well-known/est/[<PathID>/]serverkeygen +
  /.well-known/est-mtls/<PathID>/serverkeygen; openapi.yaml +
  openapi-parity guard updated.

Phase 6 — Real csrattrs implementation:
- New CertificateProfile.RequiredCSRAttributes []string + migration
  000022_certificate_profiles_csrattrs.up.sql. The migration also
  lands the previously-unwired must_staple column (closes the 5.6
  follow-up loop where the field shipped at the domain + service
  layer but the postgres scan/insert/update never persisted it).
- domain.EKUStringToOID + AttributeStringToOID lookup tables: id-kp-*
  EKUs (RFC 5280 §4.2.1.12) + RFC 5280 DN attributes + RFC 2985
  PKCS#10 attributes + Microsoft Intune device-serial OID.
- ESTService.GetCSRAttrs replaces the v2.0.x nil/204 stub with a
  profile-derived SEQUENCE OF OID ASN.1 marshal. Unknown EKU /
  attribute strings dropped + warning-logged so a typo doesn't take
  down the entire endpoint.

Phase 7 — Admin observability + counters + reload-trust:
- internal/service/est_counters.go: estCounterTab (sync/atomic; 12
  named labels) + ESTStatsSnapshot per-profile shape +
  ESTService.Stats(now) zero-allocation accessor + ReloadTrust()
  SIGHUP-equivalent + SetESTAdminMetadata setter.
- Counter ticks wired into processEnrollment + SimpleServerKeygen at
  every success/failure leg.
- internal/api/handler/admin_est.go mirrors AdminSCEPIntune verbatim:
  Profiles + ReloadTrust handlers + AdminESTServiceImpl. Both
  endpoints admin-gated (M-008 triplet pinned + admin_est.go added
  to AdminGatedHandlers).
- New routes GET /api/v1/admin/est/profiles + POST /api/v1/admin/
  est/reload-trust; openapi.yaml documented; openapi-parity guard
  reproduced clean.
- cmd/server/main.go grows estServices map populated by the per-
  profile EST loop + handed to AdminEST. New MTLSTrust() +
  HasMTLSTrust() accessors on ESTHandler so main.go can pull the
  trust holder for the admin-metadata wire-up.
- Per-profile counter isolation regression test
  (internal/service/est_profile_counter_isolation_test.go) proves
  a future shared-counter refactor would fail at compile-time
  pointer-identity check.

Pre-commit verification (sandbox): gofmt clean, go vet clean
(excluding repository/postgres which the sandbox can't build —
disk-space testcontainers download), staticcheck clean across
cms/trustanchor/api/handler/api/router/scep/intune/ratelimit/
service/pkcs7/domain/cmd/server, go test -short -count=1 green
for every non-postgres package. G-3 docs-drift guard reproduced
locally clean (Phases 5-7 added zero new env vars; Phase 1
already documented per-profile SERVER_KEYGEN_ENABLED).

Spec preserved at cowork/est-rfc7030-hardening-prompt.md. Phases
8-13 (GUI ESTAdminPage / CLI+MCP / libest e2e / bulk revocation /
docs/est.md / release prep) remain — post-2.1.0 work.
2026-04-29 23:57:45 +00:00
shankar0123 530593507b fix(scep-intune): close 11 audit gaps from 2026-04-29 pre-tag review
Closes the eleven gaps identified in the pre-v2.1.0 audit of the SCEP
RFC 8894 + Intune master bundle (cowork/scep-bundle-gap-closure-prompt.md).
Constitutional rule from cowork/CLAUDE.md::Operating Rules — 'Always
take the complete path, not the easy path' — drove this closure: each
gap was a load-bearing wire that crossed multiple layers (config →
validator → service wire-up → tests → docs) and shipping the bundle
without them would have produced lying-field footguns where operator-
visible config options stored values without affecting behavior.

WHAT LANDS:

Phase A — Clock-skew tolerance (master prompt §15 hazard closure)
  internal/scep/intune/challenge.go: ValidateChallenge migrated from
  positional args to ValidateOptions{} struct; new ClockSkewTolerance
  field with default 0 (strict). 24 call sites updated mechanically.
  Asymmetric application: now+tolerance >= iat AND now-tolerance < exp.
  internal/config/config.go: SCEPIntuneProfileConfig.ClockSkewTolerance
  default 60s + Validate() refusal when >= ChallengeValidity.
  cmd/server/main.go: SetIntuneIntegration signature extended;
  per-profile env-var loader honors CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CLOCK_SKEW_TOLERANCE.
  internal/service/scep.go: intuneClockSkew field + IntuneStatsSnapshot
  surfaces clock_skew_tolerance_ns. web/src/api/types.ts mirrors.
  4 new tests in challenge_test.go covering accept-within-tolerance,
  reject-beyond-tolerance, accept-expired-within-tolerance,
  negative-treated-as-zero defensive normalization.
  docs/scep-intune.md updated with the new env var + time-bounds rule.

Phase B — unknown-version-rejected golden test
  internal/scep/intune/golden_helper_test.go: goldenUnknownVersionPayload
  helper + signGoldenChallengeAny generic signer.
  challenge_golden_test.go: TestGoldenChallenge_UnknownVersionRejected
  uses an in-process ECDSA fixture (the on-disk PEM was generated with
  a Go-stdlib version that produces different ecdsa.GenerateKey bytes
  from the current call). TestRegenerateGoldenFixtures emits the new
  unknown_version fixture file too.

Phase C — Two named Intune e2e tests
  internal/api/handler/scep_intune_e2e_test.go:
    TestSCEPIntuneEnrollment_RateLimited_E2E (cap=2 + 3 attempts; 3rd
    returns FAILURE+badRequest with rate_limited counter ticked)
    TestSCEPIntuneEnrollment_TrustAnchorSIGHUPReload_E2E (rotate
    on-disk PEM + holder.Reload(); old-key challenge fails with
    badMessageCheck; signature_invalid counter ticked)
  intuneE2EFixture struct extended with trustHolder + trustPath fields
  so tests can rotate.

Phase D — Four new ChromeOS hermetic tests (10 total now)
  internal/api/handler/scep_chromeos_test.go:
    _RAKeyMismatch — PKIMessage encrypted to wrong RA cert; handler
      rejects without reaching service.
    _3DESBackwardCompat — RFC 8894 §3.5.2 legacy fallback verified.
    _RSACSR + _ECDSACSR — explicit matrix-pair pinning.
  buildTestECDSACSR helper for ECDSA P-256 CSR construction;
  tripleDESCBCEncrypt mirrors aesCBCEncrypt for 3DES-CBC;
  assertChromeOSPositiveCertRep shared assertion.

Phase E — Per-profile counter isolation test
  internal/api/handler/scep_profile_counter_isolation_test.go:
    TestSCEPHandler_PerProfileIntuneCountersIsolated wires two
    SCEPService instances + drives distinct PKIMessages + asserts
    counter isolation. Guards against a future cmd/server/main.go
    refactor that shares a *intuneCounterTab across profiles.
  buildPerProfileIntuneFixture parameterized helper.

Phase F — Server-boot regression tests
  cmd/server/preflight_scep_intune_test.go: 3 named tests covering
  disabled-backward-compat, broken-config-with-PathID, expired-cert
  refusal. preflightSCEPIntuneTrustAnchor signature extended with
  pathID arg so error messages carry PathID= for operator log-grep.

Phase G — docs/connectors.md
  Four new subsections under §EST/SCEP Integration: multi-profile
  dispatch + mTLS sibling route + Intune Connector dispatcher + SCEP
  probe in network scanner. Each has a one-paragraph operator
  explanation + an env-var or endpoint table.

Phase H — Coverage uplift
  internal/service/scep_probe_persist_test.go: 5 unit tests on
  persistProbeResult (nil-safe + nil-repo-safe + repo-error swallow +
  nil-logger guard) + ListRecentSCEPProbes (empty-slice-not-nil + repo
  pass-through) + describeCertAlgorithm (RSA/ECDSA/QF1008-nil-curve
  defensive branch/Ed25519/DSA/empty). CI gates (service ≥70, handler
  ≥75) PASS at 70.9% / 79.3%.

Phase I — deploy/test integration variant
  deploy/test/scep_intune_e2e_test.go (//go:build integration):
    TestSCEPIntuneEnrollment_Integration + _RateLimited_Integration
    against the live docker-compose certctl container. Skip-when-
    stack-missing semantics so sandbox + CI both work.
  deploy/docker-compose.test.yml: new e2eintune SCEP profile env
  vars + bind-mount of deploy/test/fixtures/.
  deploy/test/fixtures/README.md: documents the deterministic trust
  anchor regeneration recipe.

VERIFICATION (sandbox):
  gofmt -d        — clean for all changed files
  staticcheck     — clean for intune + handler + config + service +
                    cmd/server packages
  go vet          — clean for the same packages
  go test -short  — green for intune (95.3% cov), service (70.9%),
                    handler (79.3%), config (94.0%), cmd/server (boot
                    path; my preflight tests cover the directly-
                    testable function), pkcs7 (80.5% informational)

DEFERRED (per closure prompt §7 out-of-scope):
  - V3-Pro Conditional Access gating + Microsoft Graph integration
  - Standalone certctl-scan CLI binary
  - OCSP rate-limiting, OCSP stapling, delta CRLs

Spec preserved at cowork/scep-bundle-gap-closure-prompt.md;
journal at cowork/scep-rfc8894-intune/progress.md (audit-closure
section appended).
2026-04-29 20:28:53 +00:00
shankar0123 84fac19f98 fix(scep-probe): satisfy staticcheck QF1008 in describeCertAlgorithm
CI flagged QF1008 on the chained selector pub.Curve.Params() — the
linter wants the promoted-method form pub.Params() (Curve is embedded
in ecdsa.PublicKey, so Params is reachable via promotion). Restructure
the nil check so the embedded interface still gets validated before the
promoted call, then invoke pub.Params() once and reuse the result.

Verification:
  * gofmt clean
  * staticcheck on internal/service/...: clean
  * 6/6 TestProbeSCEP_* tests still pass
2026-04-29 19:00:05 +00:00
shankar0123 506cff137d feat(scep): SCEP probe in network scanner for fleet-readiness assessment
Phase 11.5 of the SCEP RFC 8894 + Intune master bundle. Adds an
operator-facing SCEP probe that issues GetCACaps + GetCACert against
an arbitrary SCEP server URL and returns a structured posture snapshot
(reachable + advertised caps + RFC 8894 / AES / POST / Renewal /
SHA-256 / SHA-512 support flags + CA cert subject + issuer + NotBefore
+ NotAfter + days-to-expiry + algorithm + chain length).

Two operator use cases per the master prompt:

  1. Pre-migration assessment — probe an existing EJBCA / NDES SCEP
     server before switching to certctl to see what capabilities it
     advertises and what the CA cert looks like.
  2. Compliance posture audits — periodic ad-hoc probes against the
     operator's own SCEP servers to flag drift.

Capability-only — does NOT POST a CSR per the spec (would consume slot
allocations on the target server + create audit noise). Standalone CLI
binary explicitly out of scope (per the master prompt §11.5.6 and the
operator's confirmation): the probe code lands inside certctl; a
future thin Cobra wrapper is a separate decision.

Backend (six new + one extended file):

  * internal/domain/network_scan.go — new SCEPProbeResult struct with
    every probe field documented for the GUI's display layer.

  * migrations/000021_scep_probe_results.up.sql + .down.sql — new
    scep_probe_results table with TEXT id, target_url, all probe
    flags, CA cert metadata, probed_at, probe_duration_ms, error.
    Two indexes: idx_scep_probe_results_probed_at (DESC) for the
    'recent probes' GUI query, idx_scep_probe_results_target_url
    (target_url, probed_at DESC) for the future per-URL history view.

  * internal/repository/interfaces.go — new SCEPProbeResultRepository
    interface (Insert + ListRecent).

  * internal/repository/postgres/scep_probe_results.go — Postgres
    implementation. ListRecent clamps limit to [1, 200]; on read
    re-derives ca_cert_days_to_expiry against the query-time wall
    clock so 'X days remaining' stays fresh.

  * internal/service/scep_probe.go — ProbeSCEP(ctx, url) on
    NetworkScanService. Validation order:
      1. Up-front URL validation via validation.ValidateSafeURL
         (defaults to validation.ValidateSafeURL but injectable for
         tests via the new scepValidateURL field on the service).
      2. Dial-time SSRF re-check via SafeHTTPDialContext on the
         http.Transport (defends against DNS rebinding).
      3. GET ?operation=GetCACaps + GET ?operation=GetCACert.
         GetCACert handles three response shapes: PKCS#7 SignedData
         certs-only envelope (multi-cert), raw DER (single-cert),
         and PEM-wrapped DER (non-conforming servers).
    Times out at 30s; uses a 1MB body cap for DoS defense; wraps
    the result + persists via the repo (nil-safe) before returning.
    describeCertAlgorithm helper returns 'RSA-N' / 'ECDSA-curve' /
    'Ed25519' / 'DSA' for the GUI's algorithm column.

  * internal/service/network_scan.go — added scepProbeRepo +
    scepHTTPClient + scepValidateURL + scepIDFn + nowFn fields;
    SetSCEPProbeRepo wires the repo at startup.

  * internal/api/handler/network_scan.go — extended NetworkScanService
    interface with ProbeSCEP + ListRecentSCEPProbes; added two new
    HTTP handlers:
      POST /api/v1/network-scan/scep-probe   (body {url})
      GET  /api/v1/network-scan/scep-probes  (recent history)
    Synchronous probe; HTTP 200 with the result body for both success
    and reachable-but-failed cases (so the GUI can render the failure
    tone with the operator-actionable error message).

  * internal/api/router/router.go — registered the two routes inline
    after the existing network-scan target endpoints.

  * api/openapi.yaml — documented both endpoints (operationId
    probeSCEP + listSCEPProbes) with full schema + response codes.

  * cmd/server/main.go — wires the new SCEPProbeResultRepository
    onto the network scan service via SetSCEPProbeRepo right after
    the existing NewNetworkScanService construction.

Backend tests (6 new — exit-criteria-named per the master prompt):

  * TestProbeSCEP_AdvertisesAllCaps — happy path, full RFC 8894
    capability set, ECDSA P-256 CA cert, 365-day expiry.
  * TestProbeSCEP_MissingSCEPStandard — pre-RFC-8894 server (only
    POSTPKIOperation + SHA-1 + DES3); SupportsRFC8894 = false.
  * TestProbeSCEP_GetCACertExpired — CA cert NotAfter 30d in the
    past; CACertExpired = true.
  * TestProbeSCEP_Unreachable — connect to TCP port 1; probe
    returns Reachable=false + non-empty Error.
  * TestProbeSCEP_RejectsReservedIP — http://169.254.169.254/scep
    (EC2 metadata literal) rejected by the up-front
    validation.ValidateSafeURL gate; result captures the error
    without ever issuing the HTTP call.
  * TestProbeSCEP_PEMWrappedCert — server returns PEM instead of
    raw DER for GetCACert; the fallback parse path handles it.

Frontend (one extended file + types/client):

  * web/src/api/types.ts — SCEPProbeResult + SCEPProbesResponse.
  * web/src/api/client.ts — probeSCEPServer + listSCEPProbes
    helpers.
  * web/src/pages/NetworkScanPage.tsx — new SCEPProbeSection
    component + ProbeResultPanel (with capability badges + CA cert
    details panel + raw caps line) + SCEPProbeHistoryTable. Form
    rejects empty URL with inline error before calling the API.
    Reload mutation goes through useTrackedMutation with explicit
    invalidates: [['scep-probes']] (M-009 contract).

Frontend tests (5 new + 0 regressions):

  * Scep probe section header + form renders.
  * Empty URL is rejected with inline error and never calls the
    probe endpoint.
  * Successful probe renders capability badges + CA cert subject
    + days-remaining inline panel.
  * Probe-level errors are surfaced in the inline panel (no result
    panel rendered).
  * Recent-probes history table renders one row per probe.
  * (Existing 2 NetworkScanPage XSS-hardening tests stub the new
    listSCEPProbes endpoint to an empty list so they still pass.)

Verification:
  * gofmt clean on touched files
  * go vet ./... clean
  * staticcheck on service+handler+router+repository+cmd-server clean
  * go test -short across service+handler+router+repository+cmd-server
    + integration: all green (existing + 6 new probe tests pass)
  * Frontend tsc --noEmit clean
  * Vitest: 7/7 NetworkScanPage tests pass (2 existing XSS + 5 new
    probe section)
  * G-3 docs-drift CI guard reproduced locally clean (no new env vars)
  * M-009 hard-zero useMutation guard clean (probe mutation goes
    through useTrackedMutation)
  * openapi-parity guard satisfied (both new routes documented)
  * The mockNetworkScanService in handler + integration packages
    extended with stub Probe methods; targeted coverage stays in
    scep_probe_test.go.

Out of scope (per master prompt §11.5.6 + operator confirmation):
  * Standalone certctl-scan CLI binary — separate decision, ~1d of
    follow-up work when/if shipped.

Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 11.5
      cowork/scep-rfc8894-intune/progress.md
2026-04-29 18:51:57 +00:00
shankar0123 0be889ff1d refactor(scep-gui): rebrand SCEP admin surface to per-profile tabbed interface (Profiles + Intune + Recent Activity)
Phase 9 follow-up to the SCEP RFC 8894 + Intune master bundle. The
Phase 9.4 GUI shipped 'SCEP Intune Monitoring' at /scep/intune, which
made the per-profile observability surface look Intune-only — operators
running EJBCA + Jamf would never click that nav link expecting per-
profile RA cert + mTLS observability. The page is per-profile keyed
under the hood; this commit rebrands + restructures so the surface
matches what operators actually need.

Spec: cowork/scep-gui-restructure-prompt.md.

User-visible change:

  - Nav link renamed: 'SCEP Intune' → 'SCEP Admin'.
  - Route: /scep is the new canonical path; /scep/intune kept as a
    backward-compat alias that lands directly on the Intune tab.
  - Page header: 'SCEP Administration'.
  - Three tabs:
      * Profiles (default) — per-profile lean cards with RA cert
        expiry countdown, mTLS sibling-route status badge, Intune
        enabled/disabled badge, challenge-password-set indicator.
        'View Intune details →' link on Intune-enabled cards
        deep-links into the Intune tab.
      * Intune Monitoring — the existing Phase 9.4 deep-dive
        (per-status counters, trust anchor expiry, recent failures
        table, reload-trust button + confirmation modal).
      * Recent Activity — full SCEP audit log filter merging all
        four action codes (scep_pkcsreq + scep_renewalreq +
        scep_pkcsreq_intune + scep_renewalreq_intune); chip filters
        for All / Initial / Renewal / Intune / Static.

Backend:

  * internal/service/scep.go — new SCEPProfileStatsSnapshot type +
    IntuneSection sub-block + ProfileStats(now) accessor. Adds
    raCertSubject/raCertNotBefore/raCertNotAfter + mtlsEnabled +
    mtlsTrustBundlePath fields with SetRACert + SetMTLSConfig setters.
    Existing IntuneStatsSnapshot + IntuneStats(now) preserved
    UNCHANGED for /admin/scep/intune/stats backward compat (the
    JSON shape stays byte-stable for external consumers — the
    aliasing approach the prompt initially suggested doesn't work
    because the new shape nests Intune while the old one is flat).
    ChallengePasswordSet is derived from challengePassword != ''
    (the secret value itself is never surfaced).

  * internal/api/handler/admin_scep_intune.go — new Profiles handler
    method on AdminSCEPIntuneHandler with the same M-008 admin gate.
    AdminSCEPIntuneServiceImpl extended (in place; same
    map[string]*service.SCEPService) to satisfy the new
    AdminSCEPProfileService interface. Single handler file gets the
    third method so the M-008 pin entry count stays steady (no new
    file, no new triplet of admin-gate test files — just three new
    Profiles tests inside the existing test file).

  * internal/api/router/router.go — one new route
    'GET /api/v1/admin/scep/profiles' registered to
    reg.AdminSCEPIntune.Profiles. HandlerRegistry unchanged.

  * api/openapi.yaml — new operation 'listSCEPProfiles' documenting
    the request body / response shape / error mapping. Existing
    Intune entries unchanged.

  * cmd/server/main.go — per-profile loop now calls
    scepService.SetMTLSConfig(profile.MTLSEnabled,
    profile.MTLSClientCATrustBundlePath) right after SetPathID, and
    scepService.SetRACert(raCert) right after loadSCEPRAPair returns
    the leaf cert. Both setters are nil-safe.

  * internal/api/handler/m008_admin_gate_test.go — extended the
    existing admin_scep_intune.go entry's justification to mention
    the third endpoint. No new map entry needed (file already
    listed).

Backend tests (8 new):

  * TestAdminSCEPProfiles_NonAdmin_Returns403
  * TestAdminSCEPProfiles_AdminExplicitFalse_Returns403
  * TestAdminSCEPProfiles_AdminPermitted_ForwardsActor — also pins
    that Intune-enabled profiles emit an 'intune' sub-block while
    Intune-disabled profiles OMIT it.
  * TestAdminSCEPProfiles_RejectsNonGetMethod
  * TestAdminSCEPProfiles_PropagatesServiceError
  * TestAdminSCEPProfilesServiceImpl_NilMapReturnsEmpty
  * (existing 16 Phase 9 admin tests still pass — backward-compat
    preserved)

Frontend:

  * web/src/api/types.ts — new SCEPProfileStatsSnapshot +
    IntuneSection + SCEPProfilesResponse types. Existing
    IntuneStatsSnapshot et al unchanged.
  * web/src/api/client.ts — new getAdminSCEPProfiles helper.
  * web/src/pages/SCEPAdminPage.tsx — full rewrite as the tabbed
    surface. Reuses the existing ConfirmReloadModal and Intune
    deep-dive card components verbatim; adds ProfileSummaryCard
    (lean card for the Profiles tab) and ActivityTab. URL state
    sync via useSearchParams so deep links survive reloads + browser
    back/forward. The legacy /scep/intune route alias defaults the
    activeTab to 'intune' on mount.
  * web/src/main.tsx — new <Route path='scep' /> + preserved
    <Route path='scep/intune' /> alias. Both render SCEPAdminPage.
  * web/src/components/Layout.tsx — nav link rebranded:
    label 'SCEP Intune' → 'SCEP Admin', to '/scep/intune' → '/scep'.

Frontend tests (20 — full rebuild):

  * Admin gate (non-admin sees gated banner + zero admin API calls)
  * Profiles tab default + Intune tab tabswitch + ?tab=intune deep
    link + legacy /scep/intune alias all land on Intune
  * Profiles tab status badges (Intune + mTLS + challenge-set)
    reflect each profile's flags
  * RA cert expiry tone bands (good ≥30d / warn 7-30d / bad <7d /
    EXPIRED) verified across three fixture profiles
  * 'View Intune details →' only renders for Intune-enabled
    profiles AND switches tabs on click
  * Empty-state banner when no profiles configured
  * Intune tab counters render with the existing Phase 9 deep-dive
    shape; reload modal Open/Confirm/Cancel/Error paths all pinned
  * Recent Activity tab merges all four SCEP audit actions across
    four parallel useQuery calls; filter chips
    (all/initial/renewal/intune/static) narrow correctly
  * Error path surfaces ErrorState on the active tab

Docs:

  * docs/scep-intune.md — Operational monitoring section heading
    expanded to '(SCEP Administration → Intune Monitoring tab)'.
    Page-surface description rewritten for the tabbed shape;
    admin-endpoints list extended with the new /admin/scep/profiles
    entry.
  * docs/architecture.md — Microsoft Intune Connector trust anchor
    subsection updated to reference the Intune Monitoring tab inside
    the SCEP Administration page + lists all three admin endpoints.
  * docs/legacy-est-scep.md — forward-ref expanded with a parallel
    sentence for the per-profile observability surface (independent
    of Intune).
  * README.md — Enrollment Protocols bullet for Intune updated to
    'admin GUI SCEP Administration page at /scep' with the three
    tabs called out.

Verification:
  * gofmt clean on touched files
  * go vet ./... clean
  * staticcheck on intune+service+handler+router+cmd-server clean
  * go test -short across intune+service+handler+router+cmd-server:
    all green (existing Phase 9 tests + new Profiles tests)
  * Frontend tsc --noEmit clean
  * Vitest: 20/20 SCEPAdminPage tests + 3/3 sibling AuditPage tests
    pass
  * G-3 docs-drift CI guard reproduced locally: clean (no new env
    vars; existing CERTCTL_SCEP_ allowlist prefix covers everything)
  * M-009 hard-zero useMutation guard reproduced locally: clean
    (the existing reload mutation already used useTrackedMutation
    from the Phase 9 follow-up commit 28e277a)
  * openapi-parity test green (new GET /api/v1/admin/scep/profiles
    operation documented)
  * M-008 admin-gate scanner green (existing admin_scep_intune.go
    entry covers all three handler methods; the test scanner
    enforces the triplet by file, not by endpoint, and the new
    Profiles triplet was added to the existing test file)

Backward compat preserved:
  * /api/v1/admin/scep/intune/stats unchanged — same JSON shape,
    same error codes, same M-008 gate
  * /api/v1/admin/scep/intune/reload-trust unchanged
  * /scep/intune route still works (alias to /scep with activeTab=intune)
  * IntuneStatsSnapshot Go type unchanged
  * IntuneStats(now) accessor unchanged

Refs: cowork/scep-gui-restructure-prompt.md
      cowork/scep-rfc8894-intune-master-prompt.md::Phase 9
      Phase 11.5 (SCEP probe in scanner — opt-in) and Phase 12
      (release prep + tag) of the master bundle resume after this.
2026-04-29 17:46:42 +00:00
shankar0123 77e0281a0e feat(scep-intune): GUI monitoring tab + admin endpoints
Phase 9 of the SCEP RFC 8894 + Intune master bundle. Lands the operator-
facing Intune Monitoring tab plus the two admin-gated endpoints it reads
from. Per the constitutional 'complete path' rule: counters tick on
every typed dispatcher branch, the GUI poll is live (30s for stats,
60s for the audit log filter), and the SIGHUP-equivalent reload action
is one click + a confirmation modal — no follow-up plumbing required.

Backend (Phase 9.1 + 9.2 + 9.3):

  * internal/service/scep.go gains:
    - intuneCounterTab — atomic per-status counters keyed by the same
      labels intuneFailReason() emits (success / signature_invalid /
      expired / not_yet_valid / wrong_audience / replay / rate_limited /
      claim_mismatch / compliance_failed / malformed / unknown_version).
      Lock-free on the dispatcher hot path; snapshot() returns a
      zero-allocation map for the admin endpoint.
    - dispatchIntuneChallenge wires intuneCounters.inc(...) on every
      typed return path INCLUDING the success leg (credited before
      processEnrollment so a downstream issuer-connector failure
      doesn't double-count).
    - SetPathID + PathID accessors (so admin rows surface the SCEP
      profile path ID per row).
    - IntuneStatsSnapshot + IntuneTrustAnchorInfo public types, plus
      IntuneStats(now) accessor that walks the trust holder pool and
      packages a per-profile snapshot. ReloadIntuneTrust() is the
      typed wrapper around TrustAnchorHolder.Reload that returns
      ErrSCEPProfileIntuneDisabled when called on a profile where
      Intune isn't enabled (admin endpoint maps that to HTTP 409).

  * internal/api/handler/admin_scep_intune.go:
    - AdminSCEPIntuneService narrow interface (Stats + ReloadTrust)
      so the handler depends on a small surface; AdminSCEPIntuneServiceImpl
      is the production walker over the per-profile SCEPService map.
    - AdminSCEPIntuneHandler.Stats handles GET /api/v1/admin/scep/intune/stats
      with the M-008 admin gate (non-admin → 403 + service never
      invoked); returns {profiles, profile_count, generated_at}.
    - AdminSCEPIntuneHandler.ReloadTrust handles POST
      /api/v1/admin/scep/intune/reload-trust. Body is {path_id: '<id>'};
      empty body targets the legacy /scep root profile. Returns 200 on
      success / 404 on unknown PathID / 409 when the profile is Intune-
      disabled / 500 on a parse error from intune.LoadTrustAnchor (the
      holder retains its previous pool — fail-safe). 400 on malformed
      JSON.
    - ErrAdminSCEPProfileNotFound typed error so the handler can
      distinguish 'wrong profile' from 'broken file'.

  * internal/api/router/router.go: HandlerRegistry gains
    AdminSCEPIntune; both routes registered as bearer-auth-required
    (the admin-gate is at the handler layer per the M-008 pattern).

  * cmd/server/main.go: declares scepServices map[string]*service.SCEPService
    BEFORE HandlerRegistry construction so the same map can be referenced
    from both the admin handler (constructed early) and the SCEP startup
    loop (which populates it later by reference). The per-profile loop
    now calls scepService.SetPathID(profile.PathID) and stores the service
    pointer into the shared map. AdminSCEPIntune handler is constructed
    at the same time as AdminCRLCache.

  * internal/api/handler/m008_admin_gate_test.go: AdminGatedHandlers
    map gains 'admin_scep_intune.go' with a one-line justification —
    the regression scanner enforces the per-handler test triplet
    (TestAdminSCEPIntune_NonAdmin_Returns403 + _AdminExplicitFalse_Returns403
    + _AdminPermitted_ForwardsActor) plus their POST siblings for
    ReloadTrust.

  * api/openapi.yaml: documents both endpoints with request body /
    response shape / error mapping; openapi-parity-test now matches
    the registered routes.

Frontend (Phase 9.4):

  * web/src/pages/SCEPAdminPage.tsx — single-page Intune Monitoring
    surface:
    - Per-profile cards (one card per SCEP profile). Enabled profiles
      get the full counter grid + trust-anchor-expiry badge tone
      (good ≥30d / warn 7-30d / bad <7d / EXPIRED). Disabled profiles
      get an off-state pill with the env-var hint to opt in.
    - Counters polled every 30s via TanStack Query against
      GET /admin/scep/intune/stats.
    - Recent failures table (last 50) populated from the audit log
      filtered to action=scep_pkcsreq_intune AND scep_renewalreq_intune;
      merged + sorted by timestamp descending. Polled every 60s.
    - Reload trust anchor button per profile + confirmation modal that
      explains the SIGHUP equivalence and the fail-safe behavior.
      onConfirm runs a TanStack mutation, refetches the stats query
      on success, surfaces the underlying error (eg 'trust anchor
      cert expired') in the modal on failure (modal stays open so
      operator can retry).
    - Admin gate: when authRequired && !admin the page renders an
      'Admin access required' banner and the underlying admin API
      requests are never issued (React Query enabled flag gated on
      auth.admin) — server-side enforcement is M-008.

  * web/src/api/types.ts: IntuneStatsSnapshot + IntuneTrustAnchorInfo +
    IntuneStatsResponse + IntuneReloadTrustResponse.

  * web/src/api/client.ts: getAdminSCEPIntuneStats +
    reloadAdminSCEPIntuneTrust(pathID).

  * web/src/main.tsx: new route /scep/intune. The route is unconditional;
    the gating is at the page level so deep-links land cleanly.

  * web/src/components/Layout.tsx: 'SCEP Intune' nav link between
    Observability and Audit Trail with the appropriate sidebar icon.

Tests (Phase 9.5):

  * internal/api/handler/admin_scep_intune_test.go (16 tests):
    - M-008 admin-gate triplet for both Stats (GET) and ReloadTrust
      (POST): NonAdmin / AdminExplicitFalse / AdminPermitted.
    - Method-gate tests (Stats rejects POST, ReloadTrust rejects GET).
    - Stats propagates service errors as 500.
    - ReloadTrust maps ErrAdminSCEPProfileNotFound→404,
      ErrSCEPProfileIntuneDisabled→409, generic err→500.
    - Empty body targets legacy root PathID.
    - Malformed JSON→400.
    - AdminSCEPIntuneServiceImpl handles nil map + unknown PathID.

  * web/src/pages/SCEPAdminPage.test.tsx (13 tests):
    - Admin gate (non-admin sees gated banner + zero admin API calls;
      admin sees the page; no-auth dev mode also passes).
    - Profile rendering (counters with correct labels, expiry badge
      tone for ≥30d / EXPIRED states, off-state pill for disabled
      profiles, empty-state banner when no profiles configured).
    - Reload modal (opens on click, calls mutation on Confirm,
      keeps modal open + shows error on failure, Cancel skips
      mutation).
    - Error path renders ErrorState with retry.
    - Audit log filter merges PKCSReq + RenewalReq events and sorts
      descending.

Verification:

  * gofmt clean on touched files
  * go vet ./... clean
  * staticcheck on intune/service/api/cmd-server clean
  * go test -short across api+service+intune+cmd-server: all green
  * web tsc --noEmit clean
  * Vitest: SCEPAdminPage.test.tsx 13/13 + sibling page suites all
    pass
  * G-3 docs-drift CI guard: Phase 9 adds no new CERTCTL_* env vars
    so the guard does not fire
  * openapi-parity-test green (both new admin endpoints documented)
  * M-008 regression scanner enforces the per-handler test triplet —
    pin updated, all triplets present

Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 9
      cowork/scep-rfc8894-intune/progress.md
2026-04-29 16:14:07 +00:00