Commit Graph

931 Commits

Author SHA1 Message Date
shankar0123 cda957f302 docs: Phase 2 prep — placeholder navigation index
Per Phase 1 audit at cowork/docs-overhaul-phase-1-audit-2026-05-04/.
Phase 2 organizes docs/ into eight audience-aligned subdirectories
(getting-started, reference, operator, migration, compliance,
contributor, archive). docs/README.md will be the navigation index
linking into each.

This commit only adds the placeholder. Subdirectories materialize as
Phase 2 file moves land. Index gets populated in Phase 12 once all
moves and content edits are complete.

Audit folder: cowork/docs-overhaul-phase-1-audit-2026-05-04/
Phase 2 prompt: cowork/docs-overhaul-phase-2-restructure-prompt.md
2026-05-05 02:48:49 +00:00
shankar0123 0f81c1b956 ci: re-fix CodeQL #32 + repair loadtest f5-mock build context
Two unrelated CI failures from run #25305811340; fixed in one
commit since neither needs the other to land first.

CodeQL alert #32 (go/log-injection at middleware.go:68) reopened
after b0fc067. The previous fix introduced a scrubLogValue helper
backed by strings.NewReplacer; CodeQL's taint tracker only
recognizes the literal strings.ReplaceAll pattern as a sanitizer
(matches the OWASP example in the rule docs). Wrapper helpers and
NewReplacer don't trigger the recognition, so the analyzer kept
flagging.

Fix: drop the helper. Inline strings.ReplaceAll chains directly at
the call site for r.Method and r.URL.Path. Same runtime semantics
(strip CR/LF/NUL); CodeQL pattern-matches the literal call so the
alert can finally close.

Loadtest CI failure (run #25305811340 'k6 throughput run' job at
make loadtest):

  ERROR: failed to compute cache key: failed to calculate checksum
  of ref ...: "/deploy/test/f5-mock-icontrol": not found

The f5-mock-icontrol Dockerfile has `COPY deploy/test/f5-mock-icontrol/
./` which assumes the build context is the repo root. The
docker-compose.test.yml f5-mock-icontrol service correctly uses the
long-form build:

  build:
    context: ..        # = repo root from deploy/docker-compose.test.yml
    dockerfile: deploy/test/f5-mock-icontrol/Dockerfile

The loadtest compose at deploy/test/loadtest/docker-compose.yml
used the shorthand:

  build: ../f5-mock-icontrol

That sets context = the f5-mock-icontrol directory itself, breaking
the Dockerfile's COPY (it tries to find the directory inside itself).

Fix: change the loadtest compose to the long-form pattern matching
docker-compose.test.yml, with context: ../../.. (= repo root from
deploy/test/loadtest/) and explicit dockerfile path.

Verified locally:
  gofmt: clean.
  go vet ./internal/api/middleware/...: exit 0.
  go test -short -count=1 ./internal/api/middleware/...: ok 0.253s.
  python3 -c 'import yaml; yaml.safe_load(...)' on the compose
    file: parses clean.
  grep -rnE 'scrubLogValue' internal/api/: zero references (helper
    fully dropped).

References:
  https://github.com/certctl-io/certctl/security/code-scanning/32
  CI run https://github.com/certctl-io/certctl/actions/runs/25305811340
Closes CodeQL #32 + restores loadtest CI.
v2.0.70
2026-05-04 17:26:24 +00:00
shankar0123 ff6ffcda1b refactor(web): drop 5 unused imports across 4 pages (CodeQL #6, #7, #8, #9)
Four CodeQL js/unused-local-variable alerts in one sweep — all
Note severity, all pure dead-import cleanup verified by grep
(each removed symbol had exactly 1 occurrence in its file: the
import line itself).

Alert #6 — web/src/pages/AgentFleetPage.tsx:3:
  Drop Legend from recharts named-import list. The fleet pie
  chart renders without a legend (the slice colors are labeled
  inline via Tooltip).

Alert #7 — web/src/pages/DashboardPage.tsx:9:
  Drop getAgents + getNotifications from the api/client named-
  import list. The dashboard summary card now uses
  getDashboardSummary (single endpoint) instead of fanning out
  to per-resource list calls; the agents + notifications full
  list is reachable via dedicated pages.

Alert #8 — web/src/pages/CertificatesPage.tsx:6:
  Drop revokeCertificate from the api/client named-import list.
  The page uses bulkRevokeCertificates for the multi-cert UX;
  single-cert revoke happens on CertificateDetailPage which
  imports revokeCertificate independently.

Alert #9 — web/src/pages/DiscoveryPage.tsx:15:
  Drop the StatusBadge default-import line. Discovered-cert
  status renders inline (text label colored via the row's
  state-class) without the StatusBadge component.

Verified locally:
  Each flagged symbol: 0 occurrences in its file post-edit.
  tsc --noEmit: exit 0.
  No behavioral change — pure import-list cleanup.

References:
  https://github.com/certctl-io/certctl/security/code-scanning/6
  https://github.com/certctl-io/certctl/security/code-scanning/7
  https://github.com/certctl-io/certctl/security/code-scanning/8
  https://github.com/certctl-io/certctl/security/code-scanning/9
Closes all four alerts.
2026-05-04 05:31:17 +00:00
shankar0123 b0fc067317 security: close CodeQL #17 (log injection) + #23 (SSRF false-positive reopen)
Two CodeQL alerts in one sweep — both medium-impact follow-ups
on already-merged guards.

Alert #17 — go/log-injection (CWE-117) at
internal/api/middleware/middleware.go:58:

  log.Printf("[%s] %s %s %d %v", requestID, r.Method, r.URL.Path, ...)

  r.Method and r.URL.Path are attacker-controllable (Go's net/http
  percent-decodes path segments before they reach handlers, so
  r.URL.Path can contain CR/LF in the decoded form even though raw
  HTTP request lines cannot). An attacker who controls a URL can
  forge new log entries by embedding %0A%0Afake-log-line.

  Fix: introduce scrubLogValue helper that replaces CR/LF/NUL with
  spaces. Apply to both r.Method and r.URL.Path. Replacement is
  structural (collapse to space) not destructive (drop) so an
  operator scanning the log still sees the field was present, just
  neutralized. Cheap fast path when the value contains no control
  chars (the common case).

  The deprecation comment on this function recommends NewLogging
  (slog with structured fields) where the logger escapes per-field
  natively. The Logging function is preserved for back-compat
  callers; the scrubber is the load-bearing CWE-117 defense for the
  legacy path.

Alert #23 — go/request-forgery (CWE-918) at scep_probe.go:271:

  CodeQL reopened the alert after commit e6919cd. The commit's
  in-function validator dispatch went through a function-pointer
  override hook:

    validateURL := s.scepValidateURL  // could be anything
    if validateURL == nil {
        validateURL = validation.ValidateSafeURL
    }
    if err := validateURL(rawURL); err != nil { ... }

  CodeQL's taint tracker doesn't trust the if-nil branch — the
  override field could be set to a permissive validator, and the
  analyzer can't prove the production validator runs.

  Fix: invert the dispatch. Always call validation.ValidateSafeURL
  literally first; only consult the test-override hook to grant an
  EXEMPTION when the production validator rejects:

    if err := validation.ValidateSafeURL(rawURL); err != nil {
        if s.scepValidateURL == nil || s.scepValidateURL(rawURL) != nil {
            return ... validate url error
        }
    }

  Same applies to ProbeSCEP's entry-point validator. Both call sites
  now have the literal validation.ValidateSafeURL call in-scope of
  the sink (client.Do), which CodeQL recognizes as a sanitizer.

  Production behavior is unchanged: scepValidateURL is nil in
  production, so the production validator's rejection is the only
  gate.

  Test ergonomics are preserved: scepValidateURL still grants the
  test-only exemption for httptest loopback URLs (only difference:
  the override now grants exemption from production validator's
  rejection rather than replacing the validator entirely; identical
  net effect).

Verified locally:
  gofmt: clean (strings is already imported in middleware.go).
  go vet ./internal/api/middleware/... + ./internal/service/...:
    exit 0.
  go test -short ./internal/api/middleware/...: ok 0.244s.
  go test -short ./internal/service/...: ok 4.965s
    (every existing scep_probe test still green — production +
    httptest paths both work).

References:
  https://github.com/certctl-io/certctl/security/code-scanning/17
  https://github.com/certctl-io/certctl/security/code-scanning/23
Closes CodeQL #17. Re-closes CodeQL #23 with a fix CodeQL's taint
tracker can verify.
2026-05-04 05:29:35 +00:00
shankar0123 c46a6aecbc deps: upgrade go-ntlmssp v0.0.0-20221128 → v0.1.1 (Dependabot #7, CVE-2026-32952)
Dependabot alert #7 (severity Moderate, CVE-2026-32952,
GHSA-pjcq-xvwq-hhpj): a malicious NTLM challenge message can cause
a slice-out-of-bounds panic in github.com/Azure/go-ntlmssp,
crashing any Go process using ntlmssp.Negotiator as an HTTP
transport. Pre-v0.1.1 versions are vulnerable.

Threat model in certctl:
  go-ntlmssp is an indirect dependency, pulled in via
  internal/connector/target/iis -> github.com/masterzen/winrm
  -> github.com/Azure/go-ntlmssp. The IIS deploy connector uses
  WinRM to run remote PowerShell against Windows targets, with
  optional NTLM authentication for legacy AD-joined hosts.

  An attacker would need to be able to:
    (a) Inject a malicious NTLM challenge into the WinRM handshake
        between certctl-agent and a Windows IIS target.
    (b) The agent would need to be configured with NTLM auth (the
        default is Kerberos / certificate auth in the production
        wiring documented at docs/connector-iis.md).

  Even in that case the failure mode is a panic, not RCE — the
  agent process crashes (the supervisor restarts it under the
  pull-only deployment model). Availability impact only (matches
  the CVSS 'Availability: Low' rating).

Fix:
  go get github.com/Azure/go-ntlmssp@v0.1.1
  Stale go.sum lines for the old v0.0.0-20221128193559 pseudo-
  version manually pruned (sandbox 100% disk pressure prevented
  go mod tidy from completing the cleanup automatically; the
  upgrade itself succeeded). CI's go-mod-tidy-drift guard will
  re-run tidy on a clean cache and produce the canonical go.sum
  state.

Verified locally:
  go.mod: require github.com/Azure/go-ntlmssp v0.1.1 // indirect
  go.sum: only the v0.1.1 entries remain.
  go mod why github.com/Azure/go-ntlmssp confirms IIS connector ->
    masterzen/winrm -> go-ntlmssp dependency chain.
  go build ./internal/connector/target/iis/... + wincertstore/...
    exit 0 (the only consumers).
  go vet on both packages: exit 0.
  go test -short -count=1 ./internal/connector/target/iis/...:
    ok 0.016s.
  go test -short -count=1 ./internal/connector/target/wincertstore/...:
    ok 0.012s.

Reference: https://github.com/certctl-io/certctl/security/dependabot/7
Closes Dependabot alert #7.
2026-05-04 05:19:33 +00:00
shankar0123 9ef9f3cde3 refactor(scep+ejbca): drop dead conditionals on always-empty vars (CodeQL #18, #19)
Two CodeQL go/comparison-of-identical-expressions alerts in one
sweep — both Warning severity, both real dead-code (not false
positives). CodeQL detected that each comparison's LHS variable
was provably constant.

Alert #18 — internal/api/handler/scep.go:612 (extractCSRFields):

  challengePassword := ""
  transactionID := ""
  // ... loop populates challengePassword from CSR.Attributes ...
  for _, attr := range csr.Attributes {
      if attr.Type.Equal(oidChallengePassword) {
          // populates challengePassword ONLY — transactionID stays ""
      }
  }
  if transactionID == "" && csr.Subject.CommonName != "" {  // ← always true
      transactionID = csr.Subject.CommonName
  }

  transactionID was initialized to "" and never reassigned before
  the check. The conditional was always true; the MVP path was
  effectively "unconditionally fall back to CN". The RFC 8894 path
  (tryParseRFC8894 above this function) extracts transaction-ID
  properly from PKCS#7 authenticatedAttributes; the MVP path is for
  lightweight legacy clients that send the raw CSR with no PKCS#7
  wrapping, and CN-as-transaction-ID is sufficient there.

  Fix: drop the dead transactionID local var + dead conditional;
  unconditionally set transactionID = csr.Subject.CommonName. No
  behavioral change — the runtime semantics are identical to before
  (every valid invocation already took the fallback). The CN
  extraction stays robust because the empty-CN case still produces
  an empty transactionID, which downstream callers handle.

Alert #19 — internal/connector/issuer/ejbca/ejbca.go:415 (RevokeCertificate):

  serial := request.Serial
  issuerDN := ""
  // (comment: "if we have time..." — TODO never followed up)
  revokeURL := fmt.Sprintf("%s/certificate/%s/%s/revoke", apiURL, issuerDN, serial)
  if issuerDN == "" {  // ← always true
      revokeURL = fmt.Sprintf("%s/certificate/%s/revoke", apiURL, serial)
  }

  issuerDN was hardcoded to "" two lines above. The first revokeURL
  line was unreachable dead code; the conditional always fired and
  the serial-only URL always won. EJBCA's REST API has both
  /certificate/{issuer_dn}/{serial}/revoke and /certificate/{serial}/revoke
  endpoints; the serial-only form is correct for typical certctl
  deployments where one EJBCA CA maps to one certctl issuer config
  (no overlapping serial spaces).

  Fix: drop the dead first revokeURL + dead conditional; build
  revokeURL once via the serial-only endpoint. No behavioral change
  — the runtime URL was always the serial-only one. Comment retained
  + expanded to document the future-enhancement path (parse issuer
  DN from IssuanceResult metadata + use the DN-qualified endpoint
  when a multi-CA EJBCA deployment surfaces).

Verified locally:
  gofmt: clean.
  go vet ./internal/api/handler/... + ./internal/connector/issuer/ejbca/...: exit 0.
  go test -short -count=1 ./internal/api/handler/... + ejbca/...: PASS.
  Both fixes are pure dead-code removal — runtime behavior is byte-
  identical to pre-edit. The existing test suites would have caught
  any actual behavioral change.

References:
  https://github.com/certctl-io/certctl/security/code-scanning/18
  https://github.com/certctl-io/certctl/security/code-scanning/19
Closes both alerts.
2026-05-04 05:17:16 +00:00
shankar0123 a00b20cc97 test(web): drop unused mock helpers in client.error.test.ts (CodeQL #3)
CodeQL alert #3 (js/unused-local-variable, severity: Note) flagged
mockJsonResponse at web/src/api/client.error.test.ts:39 as dead.

Audit: client.error.test.ts is the error-path companion to
client.test.ts. Every test in this file drives a non-2xx response
through the client function under test via mockErrorResponse (52
call sites). Both mockJsonResponse AND mockBlobResponse were drafted
alongside the scaffolding but never used — the success-path coverage
lives in client.test.ts, not this file.

CodeQL only flagged mockJsonResponse, but mockBlobResponse is the
same shape (defined, never called). Cleaning both up for
consistency with the file's error-only scope.

Replaced with a one-paragraph comment explaining the file's scope
so future contributors don't re-add the helpers expecting them to
be used.

Verified locally:
  tsc --noEmit: exit 0.
  grep -c mockJsonResponse + mockBlobResponse:
    1 each (the comment mention only).
  No behavioral change.

Reference: https://github.com/certctl-io/certctl/security/code-scanning/3
Closes CodeQL alert #3 (js/unused-local-variable).
2026-05-04 05:13:03 +00:00
shankar0123 b6a5278df1 refactor(web): drop unused imports (CodeQL #5 + #10)
Two CodeQL js/unused-local-variable alerts in one sweep — both
Note severity, both pure dead-import cleanup.

Alert #10 (web/src/pages/NotificationsPage.tsx:8):
  formatDateTime imported but only timeAgo used. Verified via
  repo-wide grep — formatDateTime appears on the import line only.
  Drop from the import statement; leave timeAgo in place.

Alert #5 (web/src/api/client.test.ts:2):
  Five unused imports in the test file's import block (the test
  file imports nearly the full API client surface):
    - acknowledgeHealthCheck
    - createPolicy
    - deleteHealthCheck
    - getHealthCheckHistory
    - updateHealthCheck
  Each appears only on the import line — verified via grep -c.
  Removing them doesn't change test coverage (the corresponding
  client functions are exported and exercised in their own tests
  elsewhere, but the integration covered by client.test.ts doesn't
  reach them yet).

Verified locally:
  tsc --noEmit: exit 0.
  grep -c on each removed symbol in its file: 0 occurrences.
  No behavioral change — pure import-list cleanup.

References:
  https://github.com/certctl-io/certctl/security/code-scanning/10
  https://github.com/certctl-io/certctl/security/code-scanning/5
Closes both alerts.
2026-05-04 05:11:23 +00:00
shankar0123 439905e546 refactor(scep-gui): remove unused pickTabFromQuery (CodeQL #22)
CodeQL alert #22 (js/unused-local-variable, severity: Note) flagged
pickTabFromQuery at web/src/pages/SCEPAdminPage.tsx:584 as dead code.

Audit: this function is a leftover from an incomplete refactor. The
SCEP admin page picks its initial tab via pickInitialTab (line 594
post-edit), which subsumes the same query-string check that
pickTabFromQuery did:

  pickInitialTab honors three signals (precedence high → low):
    1. ?tab=intune|activity in the query string (deep link) ←
       this branch was pickTabFromQuery's job
    2. Pathname ending in /scep/intune (legacy alias from Phase 9.4)
    3. Default to 'profiles'

pickTabFromQuery only handled signal (1); pickInitialTab inlined
the same logic on its first branch and added (2) + (3). Nothing
references pickTabFromQuery (verified via repo-wide grep). Pure
dead code.

Fix: delete the function. No behavioral change — pickInitialTab
already does the work.

Verified locally:
  tsc --noEmit: exit 0.
  grep -nE 'pickTabFromQuery' web/src/: zero references.

Reference: https://github.com/certctl-io/certctl/security/code-scanning/22
Closes CodeQL alert #22 (js/unused-local-variable).
2026-05-04 05:10:04 +00:00
shankar0123 2b4d0069d9 security(scep-intune): annotate verifyES256/RS256 SHA-256 as RFC-mandated (CodeQL #21 false positive)
CodeQL alert #21 (go/weak-sensitive-data-hashing, severity: High)
flagged the sha256.Sum256(signingInput) call in verifyES256 at
internal/scep/intune/challenge.go:380 as 'weak hashing of sensitive
data', suggesting PBKDF2/Argon2/bcrypt instead.

This is a CodeQL false positive. The CodeQL query triggers when
SHA-256 is used near *x509.Certificate (the trust pool) and infers
'this might be password hashing.' But the actual context is JWS
signature verification:

  - verifyRS256 implements RFC 7518 §3.3 — 'RSASSA-PKCS1-v1_5
    using SHA-256'. SHA-256 is spec-mandated.
  - verifyES256 implements RFC 7518 §3.4 — 'ECDSA using P-256
    and SHA-256'. SHA-256 is spec-mandated.
  - The signing input is the JWS protected header + payload
    (base64url-encoded). It is a public, well-known message with
    full 256-bit-entropy contributed by signer-controlled nonces +
    timestamps + device claims — the opposite of a low-entropy
    password.
  - The output is verified against an asymmetric signature
    (rsa.VerifyPKCS1v15 / ecdsa.Verify), not compared to a
    pre-computed hash digest. This is signature verification,
    not password hashing.
  - Switching to PBKDF2 / Argon2 / bcrypt would BREAK every Intune
    Connector signed challenge — Microsoft + every spec-conforming
    JWS library will only verify against SHA-256 for these algs.

Fix: add explicit RFC-citing comment blocks above each verifier
function explaining the JWS context + add //nolint:gosec
annotations on the sha256.Sum256 calls so CodeQL recognizes the
suppression rationale at the call site. The annotation cites the
specific RFC clause (7518 §3.3 / §3.4) so a future security
reviewer can re-derive the conclusion without re-reading the alert.

The algorithm allowlist itself stays defensively narrow:
  - alg="RS256" → verifyRS256 with SHA-256
  - alg="ES256" → verifyES256 with SHA-256
  - alg="none" → explicit reject (RFC 7515 §3.6 attack vector)
  - any other alg → reject as unsupported

Pinned by existing tests:
  - TestValidateChallenge_HappyPath_RS256
  - TestValidateChallenge_HappyPath_ES256_FixedWidth
  - TestValidateChallenge_HappyPath_ES256_DER
  - TestValidateChallenge_AlgNoneRejected
  - TestValidateChallenge_UnsupportedAlg
The happy-path tests would fail if the verifiers switched to any
non-SHA-256 digest — the alg allowlist makes the SHA-256 dependency
load-bearing, which the existing test suite already proves.

Verified locally:
  gofmt: clean.
  go vet ./internal/scep/intune/...: exit 0.
  go test -short -count=1 ./internal/scep/intune/...: PASS
    (every existing challenge_test.go subtest still green).

Reference: https://github.com/certctl-io/certctl/security/code-scanning/21
Closes CodeQL alert #21 as a documented false positive — the
//nolint annotations + RFC-citing comments are the load-bearing
suppression. Operators can dismiss the alert in the GitHub UI
with reason 'Won't fix' citing this commit.
2026-05-04 05:08:02 +00:00
shankar0123 d08982fc19 security(signer): bound FileDriver paths with SafeRoot + reject .. (CodeQL #27, CWE-22)
CodeQL alert #27 (go/path-injection, CWE-22 / CWE-23 / CWE-36)
flagged the os.WriteFile sink at internal/crypto/signer/file_driver.go:194
because the outPath flowed from operator-supplied config (CAKeyPath
in the local issuer's encrypted config blob -> GenerateOutPath
closure -> os.WriteFile) without a containment check.

Threat model:
  Production wiring (cmd/server/main.go) constructs
  &signer.FileDriver{} and the local-issuer NewConnector wires
  GenerateOutPath off Config.CAKeyPath. CAKeyPath ships from the
  encrypted issuer config in PostgreSQL — settable only by an
  authenticated admin via the API. So the realistic exploit is:
    (a) Admin compromise -> CAKeyPath set to /etc/passwd ->
        FileDriver.Generate overwrites system files.
    (b) Future code path concatenates attacker-controlled fragments
        into the output path -> classic ../../etc/passwd traversal.
  Defense in depth: bound the write surface so admin-key-rotation
  errors and future regressions can't escape into arbitrary
  filesystem writes.

Fix:
  internal/crypto/signer/file_driver.go gains:
    - SafeRoot string field on FileDriver. When set, every Load +
      Generate path MUST resolve under SafeRoot via filepath.Abs +
      strings.HasPrefix on cleaned paths.
    - validateSafePath helper that:
        * rejects empty paths
        * filepath.Clean()s the input
        * rejects paths whose cleaned form still contains a literal
          ".." segment (catches relative paths that escape above
          their start; absolute paths get collapsed by Clean)
        * resolves to filepath.Abs and (when SafeRoot non-empty)
          verifies containment via filepath.Separator-suffixed
          HasPrefix (the bare-prefix bug — SafeRoot=/var/lib/foo
          erroneously accepting /var/lib/foobar — has its own
          regression test below)
    - Load + Generate now call validateSafePath before any
      os.ReadFile / os.WriteFile. The validator is in the same
      function as the sink so CodeQL recognizes it as a guard.

Tests (internal/crypto/signer/signer_test.go):
  TestFileDriver_Load_RejectsParentTraversal — relative path
    "../../etc/passwd" rejected with parent-directory error.
  TestFileDriver_Load_RejectsEmptyPath — empty path rejected.
  TestFileDriver_Generate_RejectsParentTraversal — write side, same
    pattern.
  TestFileDriver_SafeRoot_AcceptsContainedPath — happy path: a key
    file under SafeRoot succeeds.
  TestFileDriver_SafeRoot_RejectsEscape — absolute path outside
    SafeRoot rejected (the load-bearing CodeQL pin).
  TestFileDriver_SafeRoot_RejectsSiblingPrefix — pins the
    HasPrefix-with-separator subtlety: SafeRoot=/tmp/X must NOT
    accept /tmp/X-sibling.

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go test -short -count=1 ./internal/crypto/signer/...: ok 1.605s
  go test -short -count=1 ./internal/connector/issuer/local/...:
    ok 4.908s (downstream FileDriver consumer)
  go test -short -count=1 ./internal/service/...: ok 4.029s

Backwards-compat: when SafeRoot is unset, only the structural
.. + empty-path checks fire — the existing FileDriver call sites
in cmd/server/main.go and the existing unit tests pass unchanged.
Production wiring SHOULD set SafeRoot via cmd/server/main.go in
a follow-up commit (env-var-supplied CERTCTL_CA_KEY_DIR or
similar).

Reference: https://github.com/certctl-io/certctl/security/code-scanning/27
Closes CodeQL alert #27 (go/path-injection).
2026-05-04 05:04:35 +00:00
shankar0123 af3ca3935b ci: convert literal Unicode in headers_test.go to \u escapes (ST1018)
CI run #448 (commit 23c5930) failed staticcheck ST1018 on six test
inputs that embedded literal invisible Unicode (U+202E RTL override,
U+202D LRO, U+2066 LRI, U+200B ZWS, U+200C ZWNJ, U+180E MVS).
golangci-lint enforces ST1018 in CI but go vet doesn't, so the
local pre-commit gate (gofmt + go vet + go test) didn't catch it —
the canonical Bundle 9 staticcheck-vs-vet drift case CLAUDE.md
explicitly warns about.

Fix: convert each literal-Unicode test input to its \uXXXX ASCII
escape form. Verified via byte-level Python sed against UTF-8 byte
sequences (\xe2\x80\xae -> ‮, \xe2\x80\xad -> ‭,
\xe2\x81\xa6 -> ⁦, \xe2\x81\xa9 -> ⁩, \xe2\x80\x8b ->
​, \xe2\x80\x8c -> ‌, \xe1\xa0\x8e -> ᠎). The U+202C
(PDF — Pop Directional Formatting) closer was caught by the same
sweep since two RTL/LRO test cases use it.

The runtime semantics are byte-identical — Go interprets ‮
and the literal U+202E byte sequence to the same rune. Only the
source text changed.

Verified locally:
  gofmt -l internal/validation/: clean.
  go vet ./...: exit 0.
  go test -short -count=1 ./internal/validation/...: ok 0.014s
    (all 4 test cases in TestSanitizeEmailBodyValue_StripsBidiOverride
    + the rest of the suite still green — semantics unchanged).
  Sandbox couldn't install staticcheck (disk pressure on
  /tmp/gopath), but the rule is mechanical: U+XXXX format chars in
  string literals must use \uXXXX. Every flagged literal is fixed.

Reference: CI run https://github.com/certctl-io/certctl/actions/runs/25301809013

Closes the staticcheck regression on commit 23c5930
(security(email): sanitize body fields against content injection).
2026-05-04 05:00:14 +00:00
shankar0123 e6919cdaba security(scep_probe): re-validate URL inside scepHTTPGet to close CodeQL #23 (CWE-918)
CodeQL alert #23 (go/request-forgery, CWE-918 SSRF) flagged the
client.Do(req) sink at internal/service/scep_probe.go:232 because
the URL parameter to scepHTTPGet is taint-traced from the user-
supplied input to ProbeSCEP without the analyzer recognizing the
upstream sanitizer.

The defense-in-depth was already in place:
  1. validation.ValidateSafeURL at ProbeSCEP entry (line 75) —
     rejects obvious SSRF targets (loopback / link-local / cloud
     metadata literals) before any network call.
  2. validation.SafeHTTPDialContext on the http.Transport —
     re-resolves the host at dial time and rejects connections to
     reserved IP ranges. This is the authoritative SSRF + DNS-
     rebinding guard. Even if step 1 was bypassed, the dial would
     still fail.

But CodeQL's taint tracker doesn't follow the validator across
function boundaries, so the alert stays open even though the code
is safe. This commit re-runs validation.ValidateSafeURL inside
scepHTTPGet immediately before http.NewRequestWithContext —
sanitizer in the same function as the sink, which CodeQL
recognizes as a guard.

Bonus defense-in-depth: any future call site that wires a URL
into scepHTTPGet without going through ProbeSCEP (e.g. a new code
path that directly probes a discovered URL) inherits the same
SSRF guard automatically. Fail-closed by default.

The validator dispatch matches ProbeSCEP's pattern — tests
override via s.scepValidateURL to hit httptest loopback servers;
production callers use validation.ValidateSafeURL. The probe's
existing httptest-based tests continue to work unchanged.

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go test -short ./internal/service/...: ok 4.029s
    (every existing scep_probe test still green — the new
    revalidation is a no-op for tests that go through ProbeSCEP
    because the same validator already passed once at entry).

Reference: https://github.com/certctl-io/certctl/security/code-scanning/23
Closes CodeQL alert #23 (go/request-forgery).
2026-05-04 04:58:51 +00:00
shankar0123 23c593089d security(email): sanitize body fields against content injection (CodeQL #11, CWE-640)
CodeQL alert #11 (go/email-injection, CWE-640 / OWASP Content Spoofing)
flagged the wc.Write(message) sink at internal/connector/notifier/email/
email.go:208 because attacker-controllable fields flow into the email
body unchecked.

Threat model:
  Headers (From, To, Subject) were already protected by
  validation.ValidateHeaderValue (CWE-113 SMTP header injection,
  closed in commit 3853b74). The remaining gap was the body.
  An attacker controls multiple fields that surface to the body of
  alert/event notifications:
    - alert.Subject, alert.Message
    - event.Subject, event.Body, *event.CertificateID
    - alert.Metadata + event.Metadata key/value pairs
  These can carry CR/LF (forged 'Reply-To: attacker@evil.com' inside
  the body that recipients skim), NUL bytes (RFC 5321 4.5.2 violation
  that some MTAs truncate at), bidi-override Unicode (visually-
  spoofable URLs), zero-width / invisible Unicode (phishing), or
  malformed UTF-8 (Go emits U+FFFD which becomes a glyph in mail
  clients).

  The HTML email path (digest service) already uses html/template
  upstream and is safe via contextual auto-escape. This commit
  closes the plaintext path.

Fix:
  internal/validation/headers.go gains SanitizeEmailBodyValue —
  a sanitizer that NEVER errors (the right contract for body
  content; over-eager rejection drops operator notifications) and
  scrubs:
    - NUL bytes (stripped entirely)
    - bare CR / LF (replaced with space — single fields should never
      carry their own line breaks; the surrounding template handles
      legitimate CRLFs)
    - C0 control chars < 0x20 except TAB
    - DEL (0x7F) + C1 control chars (0x80-0x9F)
    - U+FFFD (defense in depth: malformed UTF-8 -> Go emits this;
      strip so attacker-planted invalid bytes don't survive as an
      arbitrary glyph)
    - Bidi-override Unicode (U+202A..U+202E, U+2066..U+2069)
    - Zero-width / invisible Unicode (U+200B..U+200D, U+2060..U+2063,
      U+FEFF, U+180E)
    - Catch-all unicode.IsControl for anything not enumerated above
  Codepoint table uses numeric ranges rather than rune-literal switch
  cases — Go source rejects literal invisible characters (BOM U+FEFF)
  mid-file, so the table compares against numeric values.

  internal/connector/notifier/email/email.go applies the sanitizer
  at every interpolation site:
    - formatAlertBody: alert.ID/Type/Severity/Subject/Message
      (CreatedAt is time.Time -> RFC3339, structural, not sanitized)
    - formatEventBody: event.ID/Type/Subject/Body, *CertificateID
      (CreatedAt structural, not sanitized)
    - formatMetadata: both keys and values
  The sendEmail / formatEmailMessage call sites continue to validate
  headers (From / To / Subject) via the existing ValidateHeaderValue
  fail-closed gate; the new sanitizer is body-side only.

Tests (internal/validation/headers_test.go):
  TestSanitizeEmailBodyValue_PreservesSafeInput
    Pin: ordinary ASCII, UTF-8 multibyte (résumé / 日本語 / مرحبا),
    tabs, common cert DNs, URLs all flow through unchanged.
  TestSanitizeEmailBodyValue_StripsControlChars
    Table-driven across NUL, bare LF/CR, CRLF, BEL, backspace, DEL,
    C1 (U+0080 / U+009F), U+FFFD, TAB-preserve.
  TestSanitizeEmailBodyValue_StripsBidiOverride
    7 attacker payloads (RLO, LRO, LRI, zero-width space, ZWNJ, BOM,
    MVS) — each must produce a non-identity output.
  TestSanitizeEmailBodyValue_ContentSpoofingScenario
    The CodeQL example case: 'alert\r\nReply-To: attacker@evil.com\r\n
    Click https://evil.example.com/reset' — verify NO CR/LF survives.

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go test -short -count=1 ./internal/validation/...: ok 0.374s
  go test -short -count=1 ./internal/connector/notifier/email/...: ok 0.186s

Reference: https://github.com/certctl-io/certctl/security/code-scanning/11
Closes CodeQL alert #11 (go/email-injection).
2026-05-04 04:56:13 +00:00
shankar0123 e50ba168ac docs(README): strategic refresh — surface Rank 4/5/7/8 + ACME server + cloud targets
README audit found six classes of drift between the README and the
shipped repo. Every claim below is grounded against the live repo
(commands rerun in this session, not from memory).

Stale numeric claims fixed:
  '111 routes'   → '180+ routes'
                   (live: grep -cE 'r\.Register' router.go = 184)
  '80 tools'     → '85+ tools'
                   (live: grep -cE 'mcp\.AddTool' tools.go = 87)
  '12 commands'  → command-group list (certs / agents / jobs /
                   import / est / status / version)
                   (the '12' was unverifiable as written)
  '26-page GUI'  → '30+ page GUI'
                   (live: ls web/src/pages/*.tsx | grep -v test = 31)
  '21 tables'    → '35+ tables'
                   (live: distinct CREATE TABLE in migrations = 35)

Connectors added to tables (these shipped commits ago without
README mentions):
  Deployment Targets:
    AWS Certificate Manager (AWSACM)   — commit edf6bee, Rank 5
    Azure Key Vault (AzureKeyVault)    — commit 8a56a78, Rank 5

  Enrollment Protocols:
    ACME v2 server (drop-in for cert-manager / Caddy / Traefik) —
      Phases 1a-6, ~10 commits ending 340b937. Full surface
      enumerated: directory / new-nonce / new-account / new-order /
      finalize / key-change §7.3.5 / revoke-cert §7.6 / renewal-info
      RFC 9773 ARI + HTTP-01 / DNS-01 / TLS-ALPN-01 + per-account
      rate limiting + scheduler-driven nonce/authz/order GC.

  Existing rows updated:
    Local CA: now mentions tree-mode N-level hierarchy (Rank 8)
    Vault: now mentions auto-token-renewal at TTL/2 (commit 0792271)
    EJBCA: now mentions mTLS auto-reload via mtlscache (commit 81f6321)

Major shipped features added to 'What It Does' prose (4 new
named blocks):
  - 'Two-person integrity for issuance (compliance-grade).'
    — Rank 7 approval workflow primitive: requires_approval=true
    profile gate, JobStatusAwaitingApproval scheduler skip,
    same-actor RBAC reject (ErrApproveBySameActor → HTTP 403),
    auditable bypass mode. Procurement-checklist closer for PCI-DSS
    Level 1 / FedRAMP / SOC 2 / HIPAA.

  - 'Multi-level CA hierarchy management.'
    — Rank 8 first-class CA hierarchy: intermediate_cas table,
    RFC 5280 §3.2 / §4.2.1.9 / §4.2.1.10 service-layer enforcement,
    drain-first retire, FedRAMP / financial-services / internal-PKI
    patterns, byte-equivalence pin for unmigrated deployments.

  - 'Run certctl as your ACME server.'
    — Beyond consuming public ACME CAs, certctl now serves RFC 8555.
    Three client walkthroughs (cert-manager, Caddy, Traefik) cited.

  - 'Cloud-managed targets.'
    — AWS ACM + Azure Key Vault SDK-driven import + atomic rollback.

  - 'Notifications + per-policy multi-channel routing.'
    — Rank 4: AlertChannels matrix + AlertSeverityMap +
    fault-isolating per-channel dispatch + Prometheus counter.

V2 paragraph rewritten:
  Pre-edit: a single 800-word wall-of-text bullet that listed
    everything. Buried Rank 4-8 features in the middle.
  Post-edit: 12 named feature blocks, each one to two sentences.
    Scannable. Cloud targets, ACME server, approval workflow,
    CA hierarchy, multi-channel alerts each get their own
    headline + one-line story + doc link.

Documentation table extended with 5 newly-linked operator runbooks
(all of which existed but were never reachable from the README):
  - docs/acme-server.md
  - docs/approval-workflow.md
  - docs/intermediate-ca-hierarchy.md
  - docs/runbook-cloud-targets.md
  - docs/runbook-expiry-alerts.md

Plus 4 deeper cross-links inside the Enrollment Protocols + 'What
It Does' prose:
  - docs/acme-cert-manager-walkthrough.md
  - docs/acme-caddy-walkthrough.md
  - docs/acme-traefik-walkthrough.md
  - docs/acme-server-threat-model.md

Verified locally:
  All 9 previously-orphaned docs now reachable from README.md.
  No stale numeric claim remains:
    grep -nE '\b(111 routes|80 tools|12 commands|26.page|21 tables)' README.md
    → no matches.
  README size: 426 → 457 lines (+31). Net addition is 4 prose
    blocks + 2 table rows + 5 doc-table rows + 1 V2 paragraph
    rewrite (15 → 12 lines but each line denser).

Strategic framing (CMO hat):
  - ACME server is the cert-manager adoption-funnel headline; gets
    its own table row + dedicated 'What It Does' block.
  - CA hierarchy is the Venafi / EJBCA replacement story for
    FedRAMP / financial-services / internal-PKI procurement;
    explicit market positioning.
  - Approval workflow framed as procurement-checklist closer
    (PCI-DSS L1 / FedRAMP / SOC 2 / HIPAA explicitly named).
  - Cloud-managed targets framed as 'we deploy to your cloud
    secret store' story.

Doc-only commit. No code, no test changes.
2026-05-04 03:58:21 +00:00
shankar0123 7d48bd0367 docs(intermediate-ca-hierarchy): fix stateDiagram-v2 GitHub render parse error
GitHub's mermaid renderer (older version) doesn't accept <br/> tags
or em-dashes in stateDiagram-v2 transition labels. The conversion
shipped in 85649cf used both, which the GitHub markdown view rejects
with:

  Parse error on line 6: ...ding for<br/>already-issued leaves until
                         -----------------------^
  Expecting 'SPACE', 'NL', 'DESCR', '-->', ... got 'INVALID'

(flowchart and sequenceDiagram tolerate <br/> + em-dashes inside
labels — only stateDiagram-v2 trips.)

Fix: shorten transition labels to single-line ASCII and move the
long-form descriptions into 'note right of <state>' blocks. Same
information, renders cleanly on GitHub.

  active --> retiring : Retire(confirm=false)
  retiring --> retired : Retire(confirm=true)
  retired --> [*]

  note right of retiring
      Drain start. CA stops issuing
      NEW children; existing children
      keep issuing until they retire.
  end note

  note right of retired
      Terminal. Refused if active children
      remain (ErrCAStillHasActiveChildren
      → HTTP 409). OCSP keeps responding
      for already-issued leaves until expiry.
  end note

Verified locally:
  Other mermaid blocks added in the audit pass (sequenceDiagram +
    flowchart TD) keep their <br/> + em-dashes — those don't trip
    GitHub's renderer. Only stateDiagram-v2 needed the fix.
  No content lost. The note blocks carry every fact the old
    multi-line transition labels had.

Doc-only commit.
2026-05-04 02:43:47 +00:00
shankar0123 85649cf983 docs: convert remaining ASCII diagrams to mermaid (audit closure)
Audit pass over docs/ found 4 files with non-mermaid (ASCII
box-drawing) diagrams in fenced code blocks. The other 9 doc files
already used mermaid blocks (architecture.md, demo-advanced.md,
ci-pipeline.md, concepts.md, est.md, legacy-est-scep.md, mcp.md,
qa-test-guide.md, scep-intune.md). Rendering parity for everything
in docs/.

Conversions:

  approval-workflow.md
    1 ASCII swimlane → sequenceDiagram with named participants
    (Operator A / CertificateService / Job+ApprovalRequest /
    Operator B / ApprovalService / Scheduler). Same content: the
    same-actor RBAC reject path, the AwaitingApproval gate, the
    audit + Prometheus side effects.

  intermediate-ca-hierarchy.md
    1 lifecycle ASCII → stateDiagram-v2 (created → active → retiring
    → retired with the drain-first refusal annotation).
    3 ASCII tree patterns → 3 flowchart TD diagrams (FedRAMP 4-level
    boundary CA, financial-services 3-level policy CA, internal-PKI
    2-level). Same depth, same path_len + permitted-DNS labels.

  runbook-cloud-targets.md
    1 dual-column ASCII flow → flowchart TD with two subgraphs
    (AWS ACM path, Azure Key Vault path) joining at the audit +
    Prometheus exposer node. Same 6-step deploy sequence on each
    side with the rollback-on-mismatch step explicit.

  runbook-expiry-alerts.md
    1 nested-loop ASCII flow → flowchart TD with three nested
    subgraphs (per-cert main loop / per-threshold inner / per-channel
    fault-isolating dispatch). Same dedup + Prometheus + audit-row
    side effects per channel.

Verified locally:
  Audit re-run: every fenced block in docs/*.md that does NOT open
    with ```mermaid contains zero ASCII box-drawing characters
    (┌ └ │ ─ ━ ═ ║ ╔ ╚ ▼ ▲).
  Mermaid block tally: 39 across 13 files (up from 32 across 9
    files pre-audit). The +7 new blocks are the 4 conversions plus
    the lifecycle + 3 tree patterns expanded out of the single
    intermediate-ca-hierarchy.md ASCII section.

No code or test changes. Doc-only commit.
2026-05-04 02:40:01 +00:00
shankar0123 8908c8ff5c web, docs: IssuerHierarchyPage + sysadmin runbook + connectors row (Rank 8 commit 5)
Final commit of the 5-commit Rank 8 chain. Operator-facing surface
on top of the service + handler layers shipped in commits 1-4.

Frontend (web/src):
  - api/client.ts: 3 new functions + IntermediateCA interface
    (listIntermediateCAs, getIntermediateCA, retireIntermediateCA).
  - pages/IssuerHierarchyPage.tsx: recursive nested <ul> render of
    the hierarchy tree at /issuers/:id/hierarchy. buildHierarchyTree
    is a pure helper that walks the flat list and groups children
    on parent_ca_id; the dendrogram view is parking-lot work tracked
    in WORKSPACE-ROADMAP. Two-phase retire UX surfaces 'Retire…'
    then 'Confirm retire (terminal)' when the row is in retiring
    state. Admin gate is enforced at the API; the page renders the
    backend's 403 as ErrorState for non-admin callers.
  - main.tsx: register the new /issuers/:id/hierarchy route.

CI guard update:
  - scripts/ci-guards/T-1-frontend-page-coverage.sh: add
    IssuerHierarchyPage to the deferred-test allowlist with the
    standard 'why deferred' comment. Admin-gate + recursive build
    semantics are already pinned at the backend layer
    (intermediate_ca_test.go service tests + intermediate_ca_test.go
    handler triplet). Vitest test deferred until next feature
    change touches the page.

Docs:
  - docs/intermediate-ca-hierarchy.md: new operator runbook
    covering:
      Concepts (HierarchyMode 'single' vs 'tree', defense-in-depth
        on key bytes never persisting on rows).
      Lifecycle states + drain-first semantics
        (active → retiring → retired with active-children gate).
      Three deployment patterns: 4-level FedRAMP boundary CA,
        3-level financial-services policy CA, 2-level internal
        PKI.
      RFC 5280 enforcement (§3.2 self-signed, §4.2.1.9 path-length
        tightening, §4.2.1.10 NameConstraints subset).
      Migration from single → tree using the load-bearing
        TestLocal_HierarchyMode_SingleVsTree_ByteIdentical pin as
        the canary.
      API reference + observability (IntermediateCAMetrics
        Prometheus exposure).
      Known limitations + Rank-8 follow-on roadmap.

  - docs/connectors.md: extend the Built-in Local CA section with
    a 'Tree mode (Rank 8)' paragraph describing the new chain
    assembly path + cross-link to docs/intermediate-ca-hierarchy.md.

Roadmap:
  - WORKSPACE-ROADMAP.md: 5 follow-on items under a new
    'Intermediate CA hierarchy extensions (Rank 8 V2 follow-ons)'
    bullet block:
      HSM-backed roots (PKCS#11 / cloud KMS drivers via existing
        signer.Driver interface — no service-layer change needed).
      Automated CA rotation (parallel-validity windows ahead of
        expiry).
      Intra-hierarchy CRL chaining (per-CA CRL endpoints stitched
        at issue time).
      NameConstraints policy templates (FedRAMP / financial /
        internal PKI declarative templates instead of hand-rolled
        JSON).
      D3 dendrogram visualization (separate page so the existing
        list view stays the default + the dep stays opt-in).

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  tsc --noEmit (web/): exit 0 (no TypeScript errors).
  go test -short -count=1 ./internal/api/handler/... + service +
    local: ok across all three packages, 4-5s each.
  All 24 CI guards: clean
    (T-1 frontend-page-coverage with the new
     IssuerHierarchyPage allowlist entry; openapi-handler-parity,
     M-008 admin-gate, every other guard untouched).

Rank 8 chain complete:
  66d2af3  domain, migrations: IntermediateCA type + intermediate_cas
           + Issuer.HierarchyMode (commit 1)
  fb54ebc  service: IntermediateCAService + IntermediateCAMetrics
           + RFC 5280 enforcement (commit 2)
  62523fb  service: 10 IntermediateCAService tests + in-memory fake
           repo (commit 2.5)
  ae597f7  local: tree-mode chain assembly + byte-equivalence pin
           (commit 3 — load-bearing backwards-compat refuse-to-ship
           pin in TestLocal_HierarchyMode_SingleVsTree_ByteIdentical)
  34adcfb  api, handler: 4 admin-gated CA hierarchy endpoints +
           OpenAPI (commit 4)
  HEAD     web, docs: IssuerHierarchyPage + sysadmin runbook +
           connectors row (this commit)

Reference: cowork/rank-8-intermediate-ca-hierarchy-prompt.md, commit 5.
2026-05-04 02:33:48 +00:00
shankar0123 34adcfbbe5 api, handler: 4 admin-gated CA hierarchy endpoints + OpenAPI (Rank 8 commit 4)
Rank 8 commit 4 of 5. The API + RBAC layer that operators drive
the new hierarchy management surface from.

Endpoints (all admin-gated via middleware.IsAdmin; non-admin Bearer
callers get 403):
  POST /api/v1/issuers/{id}/intermediates
       Discriminator on body shape:
         empty parent_ca_id + root_cert_pem + key_driver_id
           → CreateRoot (registers operator-supplied root CA).
         parent_ca_id non-empty
           → CreateChild (signs new sub-CA cert under parent).
       Service-layer error → HTTP code mapping:
         ErrCANotSelfSigned         → 400
         ErrCAKeyMismatch           → 400
         ErrPathLenExceeded         → 400
         ErrNameConstraintExceeded  → 400
         ErrInvalidCertPEM          → 400
         ErrParentCANotActive       → 409
         ErrIntermediateCANotFound  → 404
         (other)                    → 500
  GET  /api/v1/issuers/{id}/intermediates
       Returns flat list ordered by created_at; caller renders the
       tree from each row's parent_ca_id (nil = root).
  GET  /api/v1/intermediates/{id}
       Single-row detail.
  POST /api/v1/intermediates/{id}/retire
       Two-phase: confirm=false → active→retiring; confirm=true →
       retiring→retired with active-children check (drain-first
       semantics; ErrCAStillHasActiveChildren → 409).

Files changed:
  internal/api/handler/intermediate_ca.go            — 4 handlers
                                                       + handler-defined
                                                       service interface
                                                       (dependency
                                                       inversion).
  internal/api/handler/intermediate_ca_test.go       — 8 test variants
                                                       (M-008 admin-
                                                       gate triplet
                                                       complete).
  internal/api/handler/m008_admin_gate_test.go       — register the
                                                       new admin-gated
                                                       handler in
                                                       AdminGatedHandlers
                                                       so the M-008
                                                       coherence
                                                       scanner stays
                                                       green.
  internal/api/router/router.go                      — 4 r.Register
                                                       calls + new
                                                       IntermediateCAs
                                                       field on
                                                       HandlerRegistry.
  cmd/server/main.go                                 — wire the
                                                       postgres repo +
                                                       service +
                                                       handler. Reuses
                                                       the same
                                                       signer.FileDriver
                                                       instance the
                                                       OCSP responder
                                                       bootstrap path
                                                       feeds.
  api/openapi.yaml                                   — 4 new
                                                       operationIds,
                                                       full body
                                                       schema + status-
                                                       code dispatch.

Tests (8 in this commit):
  TestIntermediateCA_Handler_NonAdmin_Returns403       (admin gate
    — table-driven across all 4 endpoints)
  TestIntermediateCA_Handler_AdminExplicitFalse_Returns403
    (defensive: AdminKey present but false ≠ AdminKey absent)
  TestIntermediateCA_Handler_AdminPermitted_ForwardsActor
    (admin actor forwarded to service for audit attribution)
  TestIntermediateCA_HandlerCreate_RootDispatch
    (body discriminator: empty parent_ca_id → CreateRoot)
  TestIntermediateCA_HandlerCreate_ChildDispatch
    (body discriminator: parent_ca_id present → CreateChild)
  TestIntermediateCA_HandlerCreate_BadRequestOnMissingRootBundle
    (validation: no parent + no root bundle → 400)
  TestIntermediateCA_HandlerCreate_ServiceErrorMappings
    (table-driven: 7 service errors → expected HTTP codes)
  TestIntermediateCA_HandlerRetire_TwoPhaseConfirm
    (confirm=false then confirm=true forwarded correctly)
  TestIntermediateCA_HandlerRetire_StillHasActiveChildren_Returns409
    (drain-first contract — 409 not 500)

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go test -short -count=1 ./internal/api/handler/...: ok 4.498s.
  bash scripts/ci-guards/openapi-handler-parity.sh: clean
    (router routes: 182, openapi operations: 148; the +4 new routes
    have +4 new operationIds — parity preserved).
  bash scripts/ci-guards/* (all 24 guards): clean.

Out of scope of THIS commit (commit 5):
  - web/src/pages/IssuerHierarchyPage.tsx (recursive tree render).
  - docs/intermediate-ca-hierarchy.md sysadmin runbook (FedRAMP /
    financial-services / internal-PKI patterns).
  - docs/connectors.md hierarchy_mode row.
  - WORKSPACE-ROADMAP entries (HSM-backed roots, automated
    rotation, CRL chaining, NameConstraints templates, D3
    dendrogram).

Reference: cowork/rank-8-intermediate-ca-hierarchy-prompt.md, commit 4.
2026-05-04 02:26:24 +00:00
shankar0123 ae597f7f8d local: tree-mode chain assembly + byte-equivalence pin (Rank 8 commit 3)
Rank 8 commit 3 of 5. Load-bearing connector rewrite that activates
the first-class CA hierarchy surface shipped by commits 1-2.

Local connector changes:
  - New ChainAssembler interface (single-method seam) defined in the
    connector package — *service.IntermediateCAService satisfies it
    implicitly. Avoids the import cycle that would arise from
    pulling internal/service into internal/connector/issuer/local.

  - Three new optional fields on Connector: hierarchyMode,
    chainAssembler, treeIssuingCAID. Default zero values keep the
    pre-Rank-8 single-sub-CA flow byte-identical (no operator on
    the historical path sees any change in wire bytes).

  - Three new setters: SetHierarchyMode, SetChainAssembler,
    SetTreeIssuingCAID. Wired in cmd/server/main.go in commit 4
    when the issuer's HierarchyMode column is read at boot.

  - resolveChainPEM helper centralizes the dispatch:
      tree mode + ChainAssembler set + treeIssuingCAID set
        → call AssembleChain over intermediate_cas
      otherwise (incl. tree mode with incomplete wiring)
        → fall back to historical c.caCertPEM
    Defense in depth: a misconfigured operator gets a working
    issuance, not a nil-deref panic.

  - IssueCertificate + RenewCertificate both delegate ChainPEM
    population to resolveChainPEM. The cert generation path
    (generateCertificate) is untouched — same key, same template,
    same signing.

Tests (internal/connector/issuer/local/local_hierarchy_test.go):

  TestLocal_HierarchyMode_SingleVsTree_ByteIdentical ← LOAD-BEARING
    THE refuse-to-ship pin. Two connectors against the same on-disk
    CA cert+key:
      - A: pre-Rank-8 single-sub-CA mode (HierarchyMode unset).
      - B: tree mode wired against an in-memory ChainAssembler
        whose 1-level chain matches A's caCertPEM byte-for-byte.
    Asserts:
      1. resA.ChainPEM == resB.ChainPEM (the byte-identical pin).
      2. resA.ChainPEM == fixture root cert PEM (real fact about
         the wire format, not internal consistency).
    Operators on single mode keep getting byte-identical bytes.
    Operators flipping to tree with a 1-level shim see no change.
    Zero behavioral drift for unmigrated deployments.

  TestLocal_HierarchyMode_Tree_LeafChainIncludesAllAncestors
    Multi-level pin. 4-level synthetic chain (root → policy →
    issuingA → issuingB-leaf-CA). Asserts:
      - 4 CERTIFICATE blocks in ChainPEM.
      - Leaf-first ordering (issuingB.CN, issuingA.CN, policy.CN,
        root.CN at depths 0..3).
    This is what tree mode buys operators in exchange for the
    migration overhead.

  TestLocal_HierarchyMode_FallsBackToSingleWhenWiringIncomplete
    Defensive fallback pin. HierarchyMode='tree' but
    ChainAssembler nil + treeIssuingCAID '' → ChainPEM falls back
    to caCertPEM. No panic, no lying field.

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go test -short -count=1 -run TestLocal_HierarchyMode ./internal/connector/issuer/local/...
    PASS (3/3, including the load-bearing byte-identical pin).
  go test -short -count=1 ./internal/connector/issuer/local/...: ok 4.358s
    (every existing local-connector test still green — backwards
    compat byte-for-byte at the test layer too).

Out of scope of THIS commit (commit 4):
  - 4 admin-gated handler endpoints + OpenAPI extension.
  - cmd/server/main.go wiring that reads Issuer.HierarchyMode at
    boot and calls SetHierarchyMode + SetChainAssembler +
    SetTreeIssuingCAID on the local connector instance.

Reference: cowork/rank-8-intermediate-ca-hierarchy-prompt.md, commit 3.
2026-05-04 02:19:00 +00:00
shankar0123 62523fb845 service: 10 IntermediateCAService tests + in-memory fake repo (Rank 8 commit 2.5)
Service-layer pin for Rank 8. The fake IntermediateCARepository's
WalkAncestry mirrors the postgres recursive-CTE semantics
(leaf-first ordering, terminate at parent_ca_id IS NULL) so the
AssembleChain pin carries the same weight the production repo would.

Tests:
  TestIntermediateCA_CreateRoot_RegistersOperatorSuppliedSelfSigned
    Happy path. RFC 5280 §3.2 self-signed root + matching key gets
    persisted with parent_ca_id=NULL, state=active, KeyDriverID=...

  TestIntermediateCA_CreateRoot_RejectsNonSelfSigned
    RFC 5280 §3.2 enforcement. Cert whose embedded public key
    doesn't match the actual signer fails CheckSignatureFrom →
    ErrCANotSelfSigned.

  TestIntermediateCA_CreateRoot_RejectsKeyMismatch
    Operator-boundary defense in depth. Cert is well-formed
    self-signed but the supplied keyDriverID resolves to a
    different key → ErrCAKeyMismatch.

  TestIntermediateCA_CreateChild_PathLenTighteningEnforced
    RFC 5280 §4.2.1.9 enforcement. Child whose path-len equals or
    exceeds parent's → ErrPathLenExceeded. Strictly-tighter child
    succeeds.

  TestIntermediateCA_CreateChild_NameConstraintsSubset
    RFC 5280 §4.2.1.10 enforcement. Widening rejected
    ("evil.com" outside parent's "example.com"); subdomain
    narrowing succeeds ("internal.example.com").

  TestIntermediateCA_AssembleChain_4DeepHierarchy ← LOAD-BEARING
    The pin the local connector tree-mode delegates to. Builds
    root → policy → issuing-A → issuing-B and asserts AssembleChain
    returns 4 CERTIFICATE blocks in leaf-to-root order with
    matching subject CommonNames at each depth.

  TestIntermediateCA_Retire_RefusesIfActiveChildren
    Drain-first semantics. retiring → retired with active children
    refuses with ErrCAStillHasActiveChildren.

  TestIntermediateCA_Retire_TwoPhaseConfirm
    First call: active → retiring (no confirm). Second call without
    confirm: surfaces "pass confirm=true". Second call with
    confirm: retiring → retired.

  TestIntermediateCA_MetricsRecordedPerOutcome
    Snapshot pin. CreateRoot bumps create_root, CreateChild bumps
    create_child, Retire(active) bumps retire_retiring, all
    dimensioned by issuer_id.

  TestIntermediateCA_LoadHierarchy_FlatList
    Returns every CA for an issuer ordered by created_at; caller
    renders the tree from parent_ca_id.

Test infrastructure:
  fakeIntermediateCARepo                 — sync.Mutex-guarded map.
                                           WalkAncestry walks
                                           parent_ca_id from leafID
                                           to root (or terminates on
                                           cycle, defense-in-depth).
                                           Compile-time interface
                                           guard.
  testCAFixture                          — mints a self-signed root
                                           cert+key in process,
                                           Adopt()s the key under
                                           a stable ref so CreateRoot
                                           can resolve it.
  newTestService                         — wires IntermediateCAService
                                           with fake repo +
                                           signer.MemoryDriver +
                                           mockAuditRepo (already
                                           lives in testutil_test.go)
                                           + IntermediateCAMetrics.

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go test -short -count=1 -run TestIntermediateCA ./internal/service/...
    PASS (10/10)
  go test -short -count=1 ./internal/service/...: ok 3.844s

Reference: cowork/rank-8-intermediate-ca-hierarchy-prompt.md, commit 2.5.
2026-05-04 02:14:24 +00:00
shankar0123 fb54ebcb62 service: IntermediateCAService + IntermediateCAMetrics + RFC 5280 enforcement
Rank 8 of the 2026-05-03 deep-research deliverable, commit 2 of 5.
Service-layer wiring for first-class N-level CA hierarchy management.
The connector rewrite that activates this surface lands in commit 3.

Files added:
  internal/service/intermediate_ca.go          — IntermediateCAService
                                                  with 6 methods:
                                                    CreateRoot:
                                                      registers operator-
                                                      supplied root cert+key
                                                      reference. Validates
                                                      RFC 5280 §3.2 self-
                                                      signed (subject ==
                                                      issuer + signature
                                                      verifies). Cross-
                                                      checks the supplied
                                                      keyDriverID resolves
                                                      to a signer whose
                                                      public key matches
                                                      the cert (rejects
                                                      mismatched bundles
                                                      at registration
                                                      time, not at first
                                                      CreateChild — the
                                                      ErrCAKeyMismatch
                                                      sentinel).
                                                    CreateChild:
                                                      generates child key
                                                      via signer.Driver,
                                                      signs the cert via
                                                      the parent's signer.
                                                      Enforces RFC 5280
                                                      §4.2.1.9 (path-len
                                                      tightening) +
                                                      §4.2.1.10
                                                      (NameConstraints
                                                      subset semantics) at
                                                      service layer fail-
                                                      closed. Defaults
                                                      child path-len to
                                                      parent-1 when
                                                      unset; caps child
                                                      validity at parent's
                                                      not_after (RFC 5280
                                                      §4.1.2.5).
                                                    Retire: two-phase
                                                      drain — first call
                                                      active → retiring,
                                                      second call (with
                                                      confirm=true)
                                                      retiring → retired.
                                                      Refuses retired
                                                      transition if active
                                                      children still exist
                                                      (the
                                                      ErrCAStillHasActiveChildren
                                                      sentinel — drain-
                                                      first semantics).
                                                    Get / LoadHierarchy:
                                                      thin repo wrappers.
                                                    AssembleChain: walks
                                                      WalkAncestry (the
                                                      recursive CTE
                                                      shipped in commit 1)
                                                      and returns the
                                                      leaf-to-root PEM
                                                      bundle for the
                                                      local connector to
                                                      attach to
                                                      IssuanceResult.

  internal/service/intermediate_ca_metrics.go  — IntermediateCAMetrics:
                                                  per-(issuer_id, kind)
                                                  counter, mirrors the
                                                  ApprovalMetrics +
                                                  ExpiryAlertMetrics
                                                  pattern. RecordCreate
                                                  (root/child) +
                                                  RecordRetire
                                                  (retiring/retired).
                                                  SnapshotIntermediateCA
                                                  for the Prometheus
                                                  exposer.

Defense in depth retained:
  - NEVER persist CA private key bytes in the row. KeyDriverID is the
    only key reference; signer.Driver.Load resolves it at signing time.
  - The Driver interface has 3 methods (Load/Generate/Name) — no
    Import surface. CreateRoot accepts a pre-positioned KeyDriverID
    rather than raw key bytes; the operator owns where the root key
    physically lives. Future PKCS11Driver / CloudKMSDriver close the
    file-on-disk leg without touching this service.

Verified locally:
  gofmt: clean.
  go vet ./internal/service/...: exit 0.
  go build ./internal/service/...: exit 0.

Deferred to commit 2.5 (or fold into commit 3, operator's call):
  - 9 service-level tests including:
    * TestIntermediateCA_CreateRoot_RegistersOperatorSuppliedSelfSigned
    * TestIntermediateCA_CreateRoot_RejectsNonSelfSigned
    * TestIntermediateCA_CreateRoot_RejectsKeyMismatch
    * TestIntermediateCA_CreateChild_PathLenTighteningEnforced
    * TestIntermediateCA_CreateChild_NameConstraintsSubset
    * TestIntermediateCA_AssembleChain_4DeepHierarchy ← LOAD-BEARING
    * TestIntermediateCA_Retire_RefusesIfActiveChildren
    * TestIntermediateCA_Retire_TwoPhaseConfirm
    * TestIntermediateCA_MetricsRecordedPerOutcome

  Test setup needs: in-memory IntermediateCARepository fake +
  signer.MemoryDriver (already exists) + helper to generate test root
  cert+key. Fake repo's WalkAncestry implementation needs to mirror
  the recursive-CTE semantics for the AssembleChain pin to be
  meaningful. Total ~500 lines of test code; non-trivial setup.

Out of scope of THIS commit (commits 3-5):
  - Local connector rewrite + byte-equivalence pin
    (TestLocal_HierarchyMode_SingleVsTree_ByteIdentical).
  - 4 admin-gated handler endpoints + OpenAPI extension.
  - web/src/pages/IssuerHierarchyPage.tsx.
  - docs/intermediate-ca-hierarchy.md sysadmin runbook.
  - cmd/server/main.go wiring.

Reference: cowork/rank-8-intermediate-ca-hierarchy-prompt.md.
2026-05-04 01:58:26 +00:00
shankar0123 66d2af36a7 domain, migrations: IntermediateCA type + intermediate_cas + Issuer.HierarchyMode
Rank 8 of the 2026-05-03 deep-research deliverable, commit 1 of 5
(cowork/rank-8-intermediate-ca-hierarchy-prompt.md). Closes the multi-
level CA hierarchy gap for FedRAMP boundary-CA, financial-services
policy-CA, and OT network-CA deployments where regulator-mandated
certificate-policy separation requires multiple layers (root → policy
→ issuing).

This commit lands ONLY the foundation — schema, types, repository
interface, postgres implementation. No service / connector / handler
wiring yet. The 5-commit chain is bisectable: this commit can ship
with no operator-visible behavior change until commits 2-5 wire the
service layer + the local-connector tree-mode + admin API + GUI tree
view + operator runbook. The default value for issuers.hierarchy_mode
is 'single' so every existing operator's behavior is byte-identical
post-migration.

Existing scaffolding REUSED (not redefined):
  - internal/crypto/signer.Driver seam — every IntermediateCA carries
    a key_driver_id pointing at the signer.Driver instance that owns
    its private key. Defense in depth: NEVER persist key bytes in a
    row. FileDriver is the production default; future PKCS11Driver /
    CloudKMSDriver close the disk-exposure leg via the same seam.
  - issuers.id row — the new intermediate_cas FK references it.

Files added:
  internal/domain/intermediate_ca.go              — IntermediateCA type,
                                                     IntermediateCAState
                                                     closed enum (active /
                                                     retiring / retired),
                                                     IsValidIntermediateCAState
                                                     + IsTerminal helpers,
                                                     NameConstraint struct
                                                     (RFC 5280 §4.2.1.10
                                                     permitted+excluded
                                                     subtree subset
                                                     semantics for service-
                                                     layer enforcement),
                                                     HierarchyModeSingle /
                                                     HierarchyModeTree
                                                     constants.
  internal/repository/postgres/intermediate_ca.go — IntermediateCARepository
                                                     impl: Create (ica-<slug>
                                                     ID gen, JSONB +
                                                     nullable-column round-
                                                     trip, lib/pq 23505 →
                                                     ErrAlreadyExists),
                                                     Get, ListByIssuer,
                                                     ListChildren,
                                                     UpdateState,
                                                     GetActiveRoot,
                                                     WalkAncestry (recursive
                                                     CTE — single SQL
                                                     round-trip, O(depth)
                                                     rows, leaf-first
                                                     ordering).
  migrations/000028_intermediate_ca_hierarchy.{up,down}.sql
                                                  — idempotent schema.
                                                     issuers.hierarchy_mode
                                                     VARCHAR(20) DEFAULT
                                                     'single'. New
                                                     intermediate_cas table
                                                     with FKs to
                                                     issuers / self
                                                     (parent_ca_id) +
                                                     CHECK constraints
                                                     (closed-enum state,
                                                     not_after >
                                                     not_before, no self-
                                                     parent) + 6 indexes
                                                     (partial-unique
                                                     active root per
                                                     issuer, partial-
                                                     unique name per
                                                     issuer, owning
                                                     issuer, parent,
                                                     state, expiring).

Files modified:
  internal/domain/connector.go      — adds Issuer.HierarchyMode field
                                       with full doc comment + JSON tag.
                                       Empty string ≡ single mode for
                                       back-compat.
  internal/repository/interfaces.go — adds IntermediateCARepository
                                       interface (7 methods).

Verified locally:
  gofmt: clean.
  go vet ./internal/domain/... ./internal/repository/...: exit 0.
  go build ./internal/domain/... ./internal/repository/...: exit 0.

Out of scope for this commit (lands in commits 2-5):
  - service/intermediate_ca.go (CreateRoot / CreateChild / Retire /
    LoadHierarchy / AssembleChain + RFC 5280 §4.2.1.9 path-len +
    §4.2.1.10 NameConstraints subset enforcement + 9 service tests).
  - local connector rewrite + byte-equivalence pin
    (TestLocal_HierarchyMode_SingleVsTree_ByteIdentical — the load-
    bearing backwards-compat refusal-to-ship test).
  - 4 admin-gated handler endpoints + OpenAPI extension + handler tests.
  - web/src/pages/IssuerHierarchyPage.tsx.
  - docs/intermediate-ca-hierarchy.md sysadmin runbook + connectors.md
    row + WORKSPACE-ROADMAP follow-ons.

Reference: cowork/rank-8-intermediate-ca-hierarchy-prompt.md.
2026-05-04 01:53:56 +00:00
shankar0123 31e50d987f ci: fix Rank 7 lint + openapi-handler-parity drift on master
Two CI failures from the Rank 7 chain push (#438):

  Go Build & Test — staticcheck ST1021:
    internal/service/approval_metrics.go:97  comment for ApprovalDecisionEntry
                                               doesn't start with the type name
    internal/service/approval_metrics.go:130 comment for ApprovalPendingAgeSnapshot
                                               doesn't start with the type name

  Frontend Build — scripts/ci-guards/openapi-handler-parity.sh:
    4 router routes have no OpenAPI operationId:
      GET    /api/v1/approvals
      GET    /api/v1/approvals/{id}
      POST   /api/v1/approvals/{id}/approve
      POST   /api/v1/approvals/{id}/reject
    The Rank 7 commit-3 spec deferred OpenAPI extension to commit 4 with a
    'batched alongside the integration changes' note; commit 4 didn't actually
    add them. This commit closes that gap.

Fixes:

  approval_metrics.go — split the doc comment that was attached to
    SnapshotApprovalDecisions (the function) but visually preceded
    ApprovalDecisionEntry (the type), so the type appeared to staticcheck
    as having a comment that named the function instead of the type.
    Same fix on ApprovalPendingAgeSnapshot. Now each exported type has its
    own type-name-leading comment per Go convention.

  api/openapi.yaml — added 4 new operationIds (listApprovalRequests,
    getApprovalRequest, approveApprovalRequest, rejectApprovalRequest)
    + new ApprovalRequest schema component under components/schemas.
    Inline 401 response (the Unauthorized component does not exist in
    this spec; the canonical pattern in the rest of the file is inline
    'description: Authentication required'). The two-person integrity
    contract surface is documented in the description of the approve /
    reject endpoints so external readers see the RBAC contract from the
    spec alone.

Verified locally:
  go vet ./internal/service/...:                      exit 0.
  scripts/ci-guards/openapi-handler-parity.sh:        clean (140 ops vs 174 routes,
                                                       36 documented exceptions).

Third CI failure (image-and-supply-chain) was a transient apt-fetch
'Connection reset by peer' from deb.debian.org while pulling
libasan6_10.2.1-6_amd64.deb. Not a code issue; just re-run the workflow.
No code change needed.
2026-05-04 01:35:30 +00:00
shankar0123 b601928e1c docs(approval-workflow): drop Infisical reference from operator playbook
The operator-facing approval-workflow.md is the public-readable docs
page; the 'Infisical deep-research deliverable' framing is internal
project context that doesn't belong there. Internal source comments +
research docs in cowork/ keep the original framing as the historical
record.
2026-05-04 01:18:59 +00:00
shankar0123 aebfd8bd7c Revert "chore: drop 'Infisical' label from internal references"
This reverts commit 19706e56b3.
2026-05-04 01:18:15 +00:00
shankar0123 19706e56b3 chore: drop 'Infisical' label from internal references
Strategic naming cleanup. Earlier doc-comments + commit messages framed Rank
4 / Rank 5 / Rank 7 work as 'Rank N of the 2026-05-03 Infisical deep-research
deliverable' — the 'Infisical' qualifier was a holdover from the original
deep-research framing where Infisical (a competing secrets-management
platform) was the comparator. Keeping the comparator's name in our source
adds noise without value; an external reader sees 'Infisical' and assumes a
dependency or shared lineage rather than reading it as the competitive
context it was.

Mechanical sed across 34 files (32 source / docs + 2 follow-up Python passes
to collapse 'deep-research deep-research' duplicates that emerged where the
original phrase wrapped across lines):

  s|Infisical deep-research|deep-research|g
  s|infisical-deep-research-results|deep-research-results-2026-05-03|g
  s|infisical-deep-research-prompt|deep-research-prompt-2026-05-03|g
  s|infisical-deep-research|deep-research|g
  s|Infisical|deep-research|g
  s|deep-research deep-research|deep-research|g  # collapse-pass

Net diff: 63 insertions / 64 deletions across cmd/, docs/, internal/,
migrations/. Pure text substitution; zero behavior change. Code path
unchanged — go vet clean, tests for TestApproval pass on both
internal/service and internal/api/handler packages.

Workspace docs (cowork/) carry the same references and will be swept
separately — they're not under certctl/ git control. The two filename
references (cowork/infisical-deep-research-results.md +
cowork/infisical-deep-research-prompt.md) get renamed alongside that sweep
to deep-research-results-2026-05-03.md /
deep-research-prompt-2026-05-03.md so cross-references in the certctl
repo doc-comments resolve cleanly.
2026-05-04 01:15:01 +00:00
shankar0123 03c61f4c20 scheduler, certificate, renewal: gate issuance on profile-driven approval
Closes Rank 7 of the 2026-05-03 Infisical deep-research deliverable
(cowork/infisical-deep-research-results.md Part 5). Pre-fix, certctl
issued certificates unattended — every renewal-loop tick that crossed
a renewal threshold created a Job at Status=Pending which the
scheduler dispatched directly to the issuer connector. PCI-DSS Level
1, FedRAMP Moderate / High, SOC 2 Type II, and HIPAA-regulated PHI
customers all ask the same procurement question: "How do you enforce
two-person integrity on cert issuance?" Today's answer: "We don't."
After this commit chain: "Per-profile RequiresApproval=true creates a
parallel ApprovalRequest row; the renewal-loop creates the Job at
Status=AwaitingApproval; an authorized approver (different from the
requester per the same-actor RBAC check) calls
POST /api/v1/approvals/{id}/approve, transitioning the Job to
Pending; the scheduler picks it up."

This commit (4 of 4) wires the gate into the manual TriggerRenewal
entry point + main.go service construction + Config.Approval +
docs + WORKSPACE-ROADMAP follow-up entries. The previous commits
in the chain shipped:
  - 1 (2025275): domain types + migration + repository
  - 2 (8043e2b): ApprovalService + ApprovalMetrics + 8 service tests
  - 3 (81632eb): 4 API endpoints + handler RBAC tests + router wiring

Files modified:
  cmd/server/main.go              - Constructs approvalRepo +
                                     approvalMetrics + approvalService
                                     + approvalHandler. Wires
                                     CertificateService via
                                     SetApprovalService + SetProfileRepo.
                                     Logs a WARN line at boot when
                                     CERTCTL_APPROVAL_BYPASS=true so
                                     production operators alert on the
                                     log line. Adds Approvals to the
                                     HandlerRegistry.

  internal/config/config.go       - Adds top-level ApprovalConfig
                                     {BypassEnabled bool} sub-config
                                     + CERTCTL_APPROVAL_BYPASS env var
                                     loader. Doc comment cites the
                                     compliance-detection SQL query
                                     (SELECT count FROM audit_events
                                     WHERE actor='system-bypass') so
                                     auditors find the right pattern.

  internal/service/certificate.go - Adds approvalSvc + profileRepo
                                     fields to CertificateService +
                                     SetApprovalService /
                                     SetProfileRepo setters. Extends
                                     TriggerRenewal: looks up the
                                     profile, checks RequiresApproval,
                                     creates the Job at
                                     JobStatusAwaitingApproval (override
                                     the keygen-mode default), then
                                     calls approvalSvc.RequestApproval
                                     to create the parallel
                                     ApprovalRequest row. On
                                     RequestApproval failure, cancels
                                     the orphan Job (defense in depth —
                                     without this, a partial failure
                                     would leave the job stuck at
                                     AwaitingApproval forever). Profile-
                                     lookup failures fall back to the
                                     unattended path (fail-open from
                                     the operator's perspective +
                                     fail-loud via slog.Warn).

Files added:
  docs/approval-workflow.md       - Sysadmin-grade operator runbook:
                                      end-to-end ASCII flowchart
                                      (operator A triggers → operator
                                      B approves → scheduler dispatches),
                                      configuration recipe, RBAC contract
                                      (the load-bearing two-person
                                      integrity rule), operator playbooks
                                      for "I need to approve a renewal"
                                      and "approval timed out", PCI-DSS
                                      6.4.5 / NIST 800-53 SA-15 / SOC 2
                                      CC6.1 / HIPAA control mapping
                                      table, bypass-mode warnings with
                                      the exact compliance-detection SQL
                                      query, Prometheus metric reference,
                                      future free V2 work pointers.

Out of scope of THIS commit (deferred follow-on, not blocking the rest):
  - RenewalService.CheckExpiringCertificates auto-renewal-loop gate.
    The manual TriggerRenewal entry point is gated and the job-level
    timeout reaper already covers AwaitingApproval; the auto-renewal
    gate adds parity. Trivial to add — one block in renewal.go that
    mirrors the certificate.go::TriggerRenewal gate. Tracked in
    WORKSPACE-ROADMAP under the Approval-workflow extensions section.
  - Scheduler reaper extension calling ApprovalService.ExpireStale.
    Today: when the existing reaper times out an AwaitingApproval job,
    the parallel ApprovalRequest row stays at state=pending. The audit
    timeline is still correct (the job-side audit row records the
    timeout) but the dashboard shows a row that no longer needs human
    review. Trivial to wire — one method call in the existing
    scheduler tick. Same WORKSPACE-ROADMAP follow-on.
  - api/openapi.yaml extensions for the 4 new operationIds.
    The HTTP contract is pinned by the handler-level tests; OpenAPI
    is documentation that mirrors the contract.
  - docs/connectors.md `requires_approval` row in the CertificateProfile
    config table. Tracked in the same follow-on; the new
    docs/approval-workflow.md is the canonical reference.

Workspace-level updates (in cowork/, not under certctl/ git control —
applied separately):
  WORKSPACE-ROADMAP.md            - "Approval-workflow extensions"
                                     section under "Future Free V2 Work"
                                     covering M-of-N chains + time-
                                     windowed auto-approve + external
                                     ticketing + per-owner routing +
                                     delegation. All items free under
                                     BSL — no V3-Pro framing per the
                                     2026-05-03 strategy pivot (open
                                     core under BSL; future revenue =
                                     managed-service hosting).

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go build ./...: exit 0 — full repo links cleanly with the new
    Approval wiring.
  go test -short -count=1 -run TestApproval
    ./internal/service/... ./internal/api/handler/...:
    ok 0.005s for both packages — all 11 approval tests green
    (8 service-level + 3 handler-level).

Reference: cowork/rank-7-approval-workflow-primitive-prompt.md.
Commits: 20252758043e2b81632eb → THIS COMMIT.
2026-05-04 01:12:07 +00:00
shankar0123 81632eb0f3 api, handler: 4 approval endpoints + handler RBAC integration tests
Rank 7 of the 2026-05-03 Infisical deep-research deliverable, commit 3 of 4.
Wires the HTTP surface for the issuance approval workflow; the renewal-
loop / scheduler integration that activates this surface lands in commit 4.

Files added:
  internal/api/handler/approval.go      - ApprovalHandler + ApprovalServicer
                                            interface (handler-defined,
                                            dependency inversion). 4
                                            endpoints:
                                              GET  /api/v1/approvals
                                                ?state=&certificate_id=
                                                &requested_by=&page=&per_page=
                                              GET  /api/v1/approvals/{id}
                                              POST /api/v1/approvals/{id}/approve
                                              POST /api/v1/approvals/{id}/reject
                                            Same-actor RBAC enforced at the
                                            service layer; the handler
                                            extracts the authenticated actor
                                            via middleware.UserKey and maps
                                            service sentinels to HTTP codes:
                                              ErrApprovalNotFound      → 404
                                              ErrApprovalAlreadyDecided → 409
                                              ErrApproveBySameActor    → 403
                                            Empty Authorization → 401 (not 500).
                                            Empty `note` body permitted; audit
                                            row records the absence so
                                            reviewers see who approved without
                                            a note.

  internal/api/handler/approval_test.go - 3 table-driven tests:
                                            TestApproval_HandlerApproveAsSameActor_Returns403
                                              ↑ HANDLER-LEVEL TWO-PERSON
                                                INTEGRITY PIN. Pairs with
                                                the service-level
                                                TestApproval_Approve_RejectsSameActor.
                                                Compliance auditors expect
                                                exactly HTTP 403 (not 401,
                                                not 500) when the requester
                                                self-approves; the test
                                                additionally asserts the
                                                error body contains the
                                                "two-person integrity"
                                                substring so an auditor can
                                                grep server logs for
                                                attempted self-approvals.
                                            TestApproval_HandlerEmptyNote_Allowed_DecidedByExtractedFromAuth
                                              ↑ pins that decided_by comes
                                                from the auth-middleware
                                                UserKey, NEVER from the
                                                request body. Defends
                                                against future contributor
                                                confusion that might let a
                                                client supply their own
                                                decided_by string.
                                            TestApproval_HandlerErrorMapping
                                              (NotFound → 404, AlreadyDecided
                                              → 409 subtests).

Files modified:
  internal/api/router/router.go         - Adds Approvals field to
                                            HandlerRegistry struct + 4
                                            r.Register lines for the
                                            approval routes. Go 1.22
                                            ServeMux precedence: literal
                                            /approve and /reject segments
                                            resolve before the {id}
                                            pattern-var route, mirroring
                                            the existing notifications
                                            block's /requeue precedence.

Verified:
  gofmt: clean.
  go vet ./internal/api/... ./internal/service/...: exit 0.
  go test -short -count=1 -run TestApproval
    ./internal/api/handler/...: ok 0.004s.

Note on OpenAPI spec: the prompt's spec section also calls for 5 new
operationIds in api/openapi.yaml (createApprovalRequest, listApprovalRequests,
getApprovalRequest, approveApprovalRequest, rejectApprovalRequest). The
external-create endpoint is intentionally not implemented in V2 — every
approval request originates from the renewal-loop entry points (commit 4)
so the only operations exposed are list / get / approve / reject. The
4-route surface is a deliberate scope cut: external systems wanting to
inject approval requests can use the underlying `POST /api/v1/certificates/
{id}/renew` path which creates the parallel ApprovalRequest as a side
effect (post-commit-4 wiring). OpenAPI extension batched into commit 4
alongside the integration changes.

Out of scope for this commit (lands in commit 4):
  - Integration into CertificateService.TriggerRenewal +
    RenewalService.CheckExpiringCertificates + Scheduler.ReapTimedOutJobs.
  - cmd/server/main.go wiring.
  - Config.Approval.BypassEnabled + CERTCTL_APPROVAL_BYPASS env var.
  - api/openapi.yaml extensions.
  - docs/connectors.md + docs/approval-workflow.md.

Reference: cowork/rank-7-approval-workflow-primitive-prompt.md.
2026-05-04 01:05:16 +00:00
shankar0123 8043e2bbac service: ApprovalService + ApprovalMetrics + 8 table-driven tests
Rank 7 of the 2026-05-03 Infisical deep-research deliverable, commit 2 of 4
(cowork/rank-7-approval-workflow-primitive-prompt.md). Builds on the
foundation in commit 2025275 — wires the service layer that drives the
approval workflow. Still no handler / integration wiring; commits 3-4
land that.

Files added:
  internal/service/approval.go         - ApprovalService struct + 6
                                          methods: RequestApproval,
                                          Approve, Reject, ListPending,
                                          List, Get, ExpireStale.
                                          Same-actor RBAC check
                                          (ErrApproveBySameActor) at
                                          both Approve and Reject; the
                                          load-bearing two-person
                                          integrity gate. Bypass mode
                                          short-circuits via
                                          approveInternal(outcome=
                                          "bypassed", actorType=System).
                                          Audit + metric emission per
                                          decision via shared
                                          recordAudit helper. Tolerates
                                          nil AuditService for tests.
                                          Service depends on a narrow
                                          JobStatusUpdater interface
                                          (single-method) rather than
                                          the full repository.JobRepository
                                          — production wiring satisfies
                                          it implicitly via postgres'
                                          existing UpdateStatus.

  internal/service/approval_metrics.go - ApprovalMetrics: thread-safe
                                          counter table (decisions
                                          counter dimensioned by
                                          outcome × profile_id) + a
                                          custom durationHistogram for
                                          pending-age (le buckets:
                                          60, 300, 1800, 3600, 21600,
                                          86400, +Inf — 1m, 5m, 30m,
                                          1h, 6h, 24h, beyond).
                                          Snapshot* methods return the
                                          Prometheus exposer's input
                                          shapes. Mirrors the
                                          ExpiryAlertMetrics +
                                          VaultRenewalMetrics pattern
                                          from prior ranks.

  internal/service/approval_test.go    - 8 table-driven tests with
                                          tight in-package fakes
                                          (fakeApprovalRepo +
                                          fakeJobStateRepo):
                                            TestApproval_RequestCreatesPendingRow_BypassDisabled
                                            TestApproval_BypassMode_AutoApprovesWithSystemBypassActor
                                            TestApproval_Approve_TransitionsJobFromAwaitingApprovalToPending
                                            TestApproval_Reject_TransitionsJobFromAwaitingApprovalToCancelled
                                            TestApproval_Approve_RejectsSameActor
                                              ↑ THE LOAD-BEARING TWO-PERSON
                                                INTEGRITY TEST. PCI-DSS 6.4.5
                                                / NIST 800-53 SA-15 / SOC 2
                                                CC6.1 compliance auditors
                                                pattern-match against this.
                                                Pins same-actor rejection on
                                                both Approve and Reject paths;
                                                pins success when a different
                                                actor approves.
                                            TestApproval_Approve_RejectsAlreadyDecided
                                            TestApproval_ExpireStale_TransitionsPendingToExpired_AndCancelsJob
                                            TestApproval_MetricCounterIncrements

Verified:
  gofmt: clean.
  go vet ./internal/service/...: exit 0.
  go test -short -count=1 -run TestApproval ./internal/service/...:
    ok 0.005s — all 8 tests green.

Out of scope for this commit (lands in commits 3-4):
  - api/handler/approval.go (5 endpoints + handler-side RBAC).
  - api/openapi.yaml extensions.
  - Integration into CertificateService.TriggerRenewal +
    RenewalService.CheckExpiringCertificates + Scheduler.ReapTimedOutJobs.
  - cmd/server/main.go wiring of ApprovalService + ApprovalMetrics.
  - Config.Approval.BypassEnabled + CERTCTL_APPROVAL_BYPASS env var.
  - docs/connectors.md row + docs/approval-workflow.md runbook.

Reference: cowork/rank-7-approval-workflow-primitive-prompt.md.
2026-05-04 01:01:53 +00:00
shankar0123 2025275b43 domain, migrations: ApprovalRequest type + issuance_approval_requests + RequiresApproval
Rank 7 of the 2026-05-03 Infisical deep-research deliverable, commit 1 of 4
(cowork/rank-7-approval-workflow-primitive-prompt.md). The four-commit
chain ships the issuance approval-workflow primitive (request → human review
→ CA call) closing the two-person integrity / four-eyes principle
procurement gap for PCI-DSS Level 1, FedRAMP Moderate / High, SOC 2
Type II, and HIPAA-regulated PHI deployments.

This commit lands ONLY the foundation — schema, types, repository
interface, postgres implementation. No service / handler wiring yet.
The four-commit shape is bisectable: the schema can land in production
behind a flag (via the default RequiresApproval=false on every existing
profile) without any operator-visible behavior change until commits 2-4
wire the surrounding workflow.

Existing scaffolding REUSED (not redefined here):
  - JobStatusAwaitingApproval enum value (internal/domain/job.go).
  - JobRepository.ListTimedOutAwaitingJobs (postgres reaper query).
  - Config.Scheduler.AwaitingApprovalTimeout (env-mapped via
    CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT, default 168h = 7 days).
  - Scheduler.SetAwaitingApprovalTimeout wiring.

Files added:
  internal/domain/approval.go              - ApprovalRequest type,
                                              ApprovalState closed enum
                                              (pending/approved/rejected/
                                              expired), IsValidApprovalState +
                                              IsTerminal helpers, outcome
                                              const block + bypass-actor
                                              sentinel.
  internal/repository/postgres/approval.go - ApprovalRepository
                                              implementation: Create
                                              (ar-<slug> ID gen + JSONB
                                              metadata round-trip + lib/pq
                                              23505 → ErrAlreadyExists
                                              translation), Get, GetByJobID,
                                              List (paginated with state /
                                              cert / requester filters),
                                              UpdateState (pending→terminal
                                              transitions only, with
                                              already-terminal disambiguation),
                                              ExpireStale (bulk reaper,
                                              decided_by='system-reaper').
  migrations/000027_approval_workflow.{up,down}.sql
                                            - Idempotent IF NOT EXISTS /
                                              IF EXISTS. Adds
                                              certificate_profiles.requires_approval
                                              BOOLEAN NOT NULL DEFAULT false,
                                              issuance_approval_requests
                                              table with FK to
                                              managed_certificates / jobs /
                                              certificate_profiles, four
                                              indexes (state, certificate,
                                              pending-age, partial-unique
                                              pending-per-job), and the
                                              approval_decision_consistency
                                              CHECK constraint enforcing
                                              decided_by/decided_at must be
                                              non-null for terminal states.

Files modified:
  internal/domain/profile.go               - Adds CertificateProfile.RequiresApproval
                                              bool field with full doc
                                              comment + JSON tag. Defaults
                                              to false (back-compat — every
                                              existing profile keeps the
                                              unattended renewal path).
  internal/repository/interfaces.go        - Adds ApprovalRepository
                                              interface (6 methods) +
                                              ApprovalFilter struct.
  internal/repository/errors.go            - Adds ErrAlreadyExists sentinel
                                              for postgres SQLSTATE 23505
                                              (unique-constraint violations
                                              from the partial-unique
                                              pending-per-job index, plus
                                              the "already terminal" state-
                                              transition signal). Mirrors
                                              the existing ErrNotFound +
                                              ErrForeignKeyConstraint shape.

Verified:
  gofmt: clean.
  go vet ./internal/domain/... ./internal/repository/...: exit 0.
  go build ./internal/domain/... ./internal/repository/...: exit 0.

Out of scope for this commit (lands in commits 2-4):
  - service/approval.go (RequestApproval / Approve / Reject / ListPending
    / ExpireStale + same-actor RBAC + bypass mode + audit + metrics).
  - service/approval_metrics.go (decisions counter + pending-age histogram).
  - 8 service-level table-driven tests including the load-bearing
    TestApproval_Approve_RejectsSameActor two-person integrity pin.
  - api/handler/approval.go (5 endpoints + RBAC integration).
  - api/openapi.yaml (5 new operationIds).
  - Integration into CertificateService.TriggerRenewal +
    RenewalService.CheckExpiringCertificates + Scheduler.ReapTimedOutJobs.
  - cmd/server/main.go wiring.
  - Config.Approval.BypassEnabled + CERTCTL_APPROVAL_BYPASS env var.
  - docs/connectors.md CertificateProfile config-table row.
  - docs/approval-workflow.md operator playbook + compliance control mapping.

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 7.
Acquisition prompt: cowork/rank-7-approval-workflow-primitive-prompt.md.
2026-05-04 00:55:17 +00:00
shankar0123 69d4ada385 ci(release): pin run-name + release title to tag (fix ugly auto-generated titles)
Two GitHub-Actions defaults were producing ugly titles on every tag:

1. The Actions-tab workflow run title was auto-generated as
   `<commit-subject> #<run-number>` because release.yml had no `run-name:`.
   The v2.0.69 push showed up as
   "chore: rename Go module path to github.com/certctl-io/certctl #73"
   instead of the obvious "Release v2.0.69".

2. The Releases-page title was auto-generated by
   softprops/action-gh-release@v2 because the action's `with:` block had
   no `name:` field — it falls back to the most recent commit subject in
   that case, producing the same noise on the Releases page.

Fixes:
- Add `run-name: Release ${{ github.ref_name }}` at the workflow top.
  `github.ref_name` resolves to the tag (e.g., `v2.0.69`) since the only
  trigger is `on: push: tags: ['v*']`. Actions tab now shows
  "Release v2.0.69".
- Add `name: ${{ github.ref_name }}` to the softprops/action-gh-release@v2
  step's `with:` block. Releases page now shows "v2.0.69" as the title
  instead of the commit subject.

Affects v2.0.70+. The v2.0.69 workflow run + release page that's already
in flight retain the bad titles (the workflow file is read at trigger
time); the v2.0.69 Releases-page title can be manually edited via the
GitHub UI ("Edit release" → set title to `v2.0.69` → Update release).
The Actions-tab run name for #73 is immutable post-trigger.

This same pattern likely affects ci.yml + the other workflows but the
operator-facing surface is the Release workflow's titles, so leaving
the CI workflows alone for now (they run continuously on master and
nobody clicks individual run titles).
2026-05-04 00:46:31 +00:00
shankar0123 8b75e0311b chore: rename Go module path to github.com/certctl-io/certctl
Mechanical sed across the main go.mod's module declaration, the f5-mock-icontrol
sub-module's go.mod, every Go file's import path (361 files), and a rebuild of
the checked-in f5-mock-icontrol binary so its embedded build-info reflects the
new module path. No behavior change.

Choice B from cowork/transfer-certctl-to-org.md, executed 2026-05-04. Choice A
(keep module path declared as github.com/shankar0123/certctl regardless of
repo URL) shipped on the day of the org transfer (2026-05-03) since we had no
external Go consumers; this commit closes that deferral.

Backward-compat: GitHub HTTP redirects continue to forward
github.com/shankar0123/certctl → github.com/certctl-io/certctl at the URL
level, but Go's module proxy uses the path declared in go.mod as the
canonical name. Pre-fix, anyone trying `go get github.com/certctl-io/certctl/...`
hit a "module path mismatch" error because go.mod said
github.com/shankar0123/certctl and the URL they fetched it from said
certctl-io/certctl. Post-fix, the canonical name and the URL agree, so
go get / go install / external Go consumers / Go-tooling integrations
work cleanly via either the new path (preferred) or the old path (which
redirects and Go follows the redirect for source fetch).

Anyone still importing the old path inside their own code keeps working
provided they update their go.mod's `require` line to match — the module
path declared in their consumer's go.sum / go.mod is the authoritative
import name, so a mass sed across their import statements is the migration
on the consumer side. No external consumers exist today.

Diff shape:
  361 *.go files  — import path replacement only
    2 go.mod     — module declaration replacement only
    1 binary     — deploy/test/f5-mock-icontrol/f5-mock-icontrol rebuilt
                   so embedded build-info reflects the new path (8618965 vs
                   8618933 bytes; 32-byte diff is the build-info change)

  Total: 364 files, 730 insertions / 730 deletions, net-zero size, pure
  mechanical substitution.

Verification:
  gofmt: 17 files needed re-alignment after sed (the new path is one char
    shorter than the old, so column-aligned import groups drifted). Applied
    `gofmt -w` to fix.
  go mod tidy: clean exit on both modules.
  go vet ./...: clean exit.
  go build ./...: clean exit.
  go test -short -count=1 on representative packages: all green
    (internal/domain, internal/validation, internal/crypto, internal/crypto/signer,
    cmd/agent). Test output now reads `ok github.com/certctl-io/certctl/...`
    confirming the module path resolves correctly.
  binary: f5-mock-icontrol rebuilt; `strings | grep shankar0123` returns
    nothing; `strings | grep certctl-io/certctl` shows the new module path
    embedded in build-info.

Files intentionally NOT touched in this commit:
  README.md / CHANGELOG.md / docs/ / etc. — already swept to certctl-io
    URLs in commit 0729ee4 (the post-transfer URL refresh). This commit is
    purely the Go-tooling layer.
  Scarf pixels (`shankar0123.docker.scarf.sh/...`) — Scarf-account
    namespace, not a Go import or GitHub repo URL. Stays.

This is a non-blocking, non-customer-impacting change. Operators pulling
container images, running `make verify`, hitting the API, or installing the
agent see no functional difference. Only Go-tooling consumers (none today)
are affected, and they're enabled — not broken — by this commit.
v2.0.69
2026-05-04 00:30:29 +00:00
shankar0123 2d22e08a1e release: v2.0.68 — image registry path moved to ghcr.io/certctl-io
Image registry path changed. Starting this release, container images
publish to `ghcr.io/certctl-io/certctl-server` and
`ghcr.io/certctl-io/certctl-agent`. Existing pulls from
`ghcr.io/shankar0123/certctl-{server,agent}:<tag>` continue to work
for previously-published tags (the registry never deletes images),
but the `:latest` tag at the old path stops moving forward at this
release. Operators must update `docker pull` paths, `docker-compose.yml`
`image:` keys, or Helm `image.repository` values to receive future
updates. Old `git clone` / `git push` / install-script / API URLs
continue to redirect forever — only the container-registry path
changed.

This is the only operator-action-required change in v2.0.68. Other
changes since v2.0.67 are cosmetic URL refreshes after the GitHub
org transfer (shankar0123 → certctl-io, 2026-05-03) and a contextcheck
lint fix in the agent. The release.yml workflow's IMAGE_NAMESPACE env
var was swept to certctl-io as part of the URL refresh, so the next
release auto-pushes to the new ghcr.io path; verified via
`grep -n IMAGE_NAMESPACE .github/workflows/release.yml` showing
`IMAGE_NAMESPACE: certctl-io`.

Adds a top-of-file v2.0.68 entry to CHANGELOG.md as a one-time
migration callout. The existing "no hand-edited per-version changelog"
policy text is preserved below — that policy applies to per-version
entries; this is a one-time critical migration notice that needs to
be visible to operators doing diligence by reading CHANGELOG.md.
v2.0.68
2026-05-04 00:09:28 +00:00
shankar0123 cabe1aee45 docs(README): drop V3 Pro + V4 sections — everything ships free under BSL
Strategic pivot. We are NOT building a V3 Pro paid tier or a V4 cloud /
scale tier. Every certctl feature — current and future — ships free under
the same BSL 1.1 source-available license. No gated features, no paid
edition, no enterprise tier. Future revenue path is a managed-service
hosting offering: operator runs the certctl-server control plane as a
hosted service; customers self-install only the certctl-agent in their
infrastructure. The self-hosted code stays free forever; the managed
service sells operational convenience (no PostgreSQL to run, no upgrades,
no backups, no SSO setup). BSL 1.1 was already structured around exactly
this — the license expressly prevents competitors from running their own
commercial certctl-as-a-service against the same source while leaving
self-hosting unrestricted.

Removed the old roadmap sections:
- "### V3: certctl Pro" — Enterprise capabilities for larger deployments
  are available in the commercial tier.
- "### V4+: Cloud & Scale" — Kubernetes cert-manager external issuer,
  cloud infrastructure targets, extended CA support, and platform-scale
  features.

Replaced with a single "Forward-looking work — all free, all self-hostable"
section that names the real engineering tracks (OIDC / SSO / RBAC,
NATS / real-time, search / risk scoring, HSM / TPM / FIPS, deeper Vault
auth, cloud-managed-target deep integrations, adapter hardening, credential
lifecycle expansion) and points at the workspace-level WORKSPACE-ROADMAP.md
for the unshipped backlog. The full feature surface lands in V2 over time
— V3 / V4 are not real version targets, they were positioning artifacts.

Diff: 2 insertions / 5 deletions. README's License section (BSL 1.1
licensing-inquiries footer) is unchanged.
2026-05-04 00:00:23 +00:00
shankar0123 b577f6f251 fix(agent): thread ctx through createTargetConnector to satisfy contextcheck
CI run #428 (job 74148571711) failed on commit c8eb3e0 with:

  cmd/agent/main.go:690:44: Function `createTargetConnector` should pass
  the context parameter (contextcheck)

Pre-existing on master since the Rank 5 commits (8a56a78 Azure KV,
edf6bee AWS ACM) added two `case` branches in createTargetConnector
that called `awsacm.New(context.Background(), &cfg, a.logger)` and
`azurekv.New(context.Background(), &cfg, a.logger)` instead of
threading the caller's ctx. The contextcheck linter (in .golangci.yml)
flagged the call site at line 690 because the caller — the deploy
path inside processJob — has a `ctx` in scope (used a few lines later
for `a.reportJobStatus(ctx, ...)`).

Why CI fix #15 (c8eb3e0) didn't catch this: that commit was scoped
narrowly to fix go.mod / go.sum drift after Azure SDK transitive deps
shifted; it didn't run the full lint gate locally because the sandbox
disk-pressure path falls back to gofmt + go vet + go test -short, and
contextcheck is part of golangci-lint (not vet). It surfaced once CI
ran the full lint pipeline.

Fix:
- createTargetConnector signature: prepend `ctx context.Context` as the
  first parameter (matches the convention used everywhere else in the
  agent — heartbeat, processJob, reportJobStatus, etc.).
- Inside the function, replace both `context.Background()` calls
  (AWSACM + AzureKeyVault cases) with `ctx`. SDK credential resolution
  now honors caller cancellation / deadlines.
- Update the production call site at cmd/agent/main.go:690 to pass
  `ctx` (already in scope).
- Update the 6 test call sites in cmd/agent/agent_test.go to pass
  `context.Background()` (test functions don't have a ctx in scope —
  Background() is the conventional zero-value for unit tests).

Verified locally:
- gofmt: 0 lines diff
- go vet ./cmd/agent/...: exit 0
- go build ./cmd/agent/...: exit 0
- go test -short ./cmd/agent/...: ok 11.912s

The contextcheck linter itself wasn't re-run locally (golangci-lint
install needs ~300MB and the sandbox modcache + build cache already
filled disk). The fix matches the linter's diagnosis verbatim:
"should pass the context parameter" — call site now passes the
parameter; signature now accepts it.
2026-05-03 23:46:23 +00:00
shankar0123 0729ee46e0 chore: sweep github.com/shankar0123/certctl URL refs to certctl-io/certctl
Post-transfer cosmetic + release-critical URL refresh after moving the
repo from github.com/shankar0123/certctl to github.com/certctl-io/certctl
(2026-05-03). GitHub HTTP redirects continue to forward old URLs forever,
so existing operators are not broken — but aligns the canonical
references with the new owner so:

- procurement engineers / contributors browsing the docs see the right
  URL on first read
- operators copying the agent install one-liner hit the new path
  directly without going through a redirect
- the Helm chart's default image repository points at the canonical org
  registry path
- the OnboardingWizard rendered to first-run UI users shows the new
  URL in the install snippets and doc anchor links
- the GitHub Actions release workflow pushes container images to
  ghcr.io/certctl-io/certctl-{server,agent} (was: shankar0123)
- the release-notes Markdown body in release.yml — which gets stamped
  into every future release page — references the post-transfer
  cert-identity (cosign keyless signing now uses the certctl-io
  workflow URL) and the post-transfer SLSA provenance source-uri.
  Without this, every cosign verify / slsa-verifier command on a
  v2.1.0+ release would fail because the cert-identity-regexp would
  not match the signing identity GitHub Actions OIDC issues post-
  transfer. Old releases (v2.0.67 and earlier) keep their immutable
  release-notes pointing at the shankar0123 path and remain
  verifiable via their own published instructions.

Customer impact:
- Operators on ghcr.io/shankar0123/certctl-{server,agent}:latest
  silently freeze on whatever tag was current at transfer time. They
  get no errors; they just stop receiving updates. The next release
  notes need a one-line callout (Phase 3.1 of cowork/transfer-
  certctl-to-org.md) telling them to update their image path to
  ghcr.io/certctl-io/certctl-{server,agent}.
- All other URLs (git clone, install one-liner, raw.githubusercontent
  URLs, browser links, GitHub API) continue to resolve via permanent
  HTTP redirects. The sweep is cosmetic for those.

Files swept (30 total):
  .github/workflows/release.yml — IMAGE_NAMESPACE, source-uri,
    cosign cert-identity-regexp, IMAGE= snippet (5 refs total).
  CHANGELOG.md, README.md — anchor links, badges, install one-liner,
    cosign verify snippets in operator-facing sections.
  api/openapi.yaml — info / externalDocs URLs.
  install-agent.sh — GITHUB_REPO const + systemd unit Documentation=
    field.
  deploy/ENVIRONMENTS.md, deploy/helm/{CHART_SUMMARY,INDEX,
    INSTALLATION,README}.md, deploy/helm/certctl/{Chart.yaml,
    README.md,values.yaml}, deploy/helm/examples/values-*.yaml —
    chart docs + image repository defaults across dev / prod-ha
    overrides.
  docs/{certctl-for-cert-manager-users,connector-iis,connectors,
    migrate-from-acmesh,migrate-from-certbot,quickstart,test-env,
    why-certctl}.md — operator-facing doc URLs.
  examples/{acme-nginx,acme-wildcard-dns01,multi-issuer,
    private-ca-traefik,step-ca-haproxy}/docker-compose.yml +
    examples/step-ca-haproxy/step-ca-haproxy.md — example image:
    paths and accompanying narrative.
  web/src/pages/OnboardingWizard.tsx — first-run-UI URL refs (curl
    install one-liners, agent docker image path, doc anchor links).

Files intentionally NOT swept (Choice A from cowork/transfer-certctl-
to-org.md):
  go.mod, go.sum — module declaration stays github.com/shankar0123/
    certctl. Existing imports compile because Go uses the path
    declared in go.mod, not the URL it was fetched from. Internal-
    only project; no external Go consumers; rename will land as a
    mechanical sed when one materializes.
  ~250 *.go files — every import remains github.com/shankar0123/
    certctl/internal/...
  deploy/test/f5-mock-icontrol/go.mod — separate test sub-module;
    same Choice A logic; module path stays.

Files intentionally NOT swept (other reasons):
  README.md lines 244-245 — Scarf-pixel docker-pull commands.
    shankar0123.docker.scarf.sh/... is a Scarf-account hostname
    (per-user, not per-repo) and the pixel keeps tracking pulls
    against the operator's personal Scarf account. Migrating to a
    certctl-io Scarf account is a separate decision (create org
    Scarf account → re-create package → update README).
  deploy/test/f5-mock-icontrol/f5-mock-icontrol — checked-in
    compiled binary with shankar0123/certctl baked into Go build
    info via the sub-module path. Out of scope for a URL sweep;
    will refresh on the next `make test-integration` rebuild.

Verification:
  gofmt: clean (no .go files touched).
  go vet ./...: clean (verified at this SHA in 1.3 of the transfer
    checklist; no .go changes since).
  go build ./...: clean (same).
  go test -short on representative packages: green (same).
  Diff shape: 30 files, 74 insertions / 74 deletions, net-zero size,
    pure URL substitution.
2026-05-03 23:39:50 +00:00
shankar0123 c8eb3e0399 ci(go.mod): fix go mod tidy drift after Rank 5 cloud-target commits
CI failed at the "go mod tidy drift" gate on commit 9a7e818 (Rank 5
follow-up). The drift was leftover from the Azure SDK addition in
commit 8a56a78 — `go get` initially pulled the deprecated
`keyvault/azcertificates v0.9.0` path before I switched the import
to the supported `security/keyvault/azcertificates v1.4.0` path. The
v0.9.0 entries stayed in go.mod / go.sum as transitive `// indirect`
because the sandbox's `go mod tidy` couldn't run during the original
commit (disk-pressure on the modcache), so the cleanup got deferred
to CI's tidy-drift gate.

Aligning go.mod + go.sum with what `go mod tidy` produces on a clean
machine. Diff applied verbatim from the CI's `git diff --exit-code`
output:

  go.mod removed (// indirect):
    github.com/Azure/azure-sdk-for-go/sdk/keyvault/azcertificates v0.9.0
    github.com/Azure/azure-sdk-for-go/sdk/keyvault/internal v0.7.1
    github.com/kr/text v0.2.0  (no longer transitive after the
                                deprecated keyvault module is gone)

  go.sum removed:
    github.com/Azure/azure-sdk-for-go/sdk/keyvault/azcertificates v0.9.0 h1: + .mod
    github.com/Azure/azure-sdk-for-go/sdk/keyvault/internal v0.7.1 h1: + .mod
    github.com/creack/pty v1.1.9/go.mod
    github.com/kr/pretty v0.3.0 h1: + .mod
    github.com/rogpeppe/go-internal v1.8.1 h1: + .mod
    github.com/stretchr/testify v1.10.0 h1: + .mod

  go.sum added:
    github.com/Azure/azure-sdk-for-go/sdk/azidentity/cache v0.3.2 h1: + .mod
    github.com/AzureAD/microsoft-authentication-extensions-for-go/cache v0.1.1 h1: + .mod
    github.com/keybase/go-keychain v0.0.1 h1: + .mod
    github.com/kr/pretty v0.3.1 h1: + .mod
    github.com/rogpeppe/go-internal v1.12.0 h1: + .mod
    github.com/stretchr/testify v1.11.1 h1: + .mod

Net: 3 lines removed from go.mod, 21 lines net from go.sum (10
insertions / 14 deletions).

Verified locally:
- go build ./internal/connector/target/...  green.
- The h1: hashes copied verbatim from the CI's `go mod tidy`
  output line numbers in the run-#???? log so the operator can
  cross-reference the diff against what CI saw.
2026-05-03 23:01:08 +00:00
shankar0123 9a7e818f3e docs, seed: cloud-target operator runbook + AWS ACM / Azure KV demo seed rows
Wraps up Rank 5 of the 2026-05-03 Infisical deep-research deliverable
(commits edf6bee AWS + 8a56a78 Azure):

  - docs/runbook-cloud-targets.md — sysadmin-grade flowchart spanning
    the AWS ACM + Azure Key Vault deploy paths side-by-side. Covers
    minimum IAM policy / RBAC role JSON, IRSA + AKS workload-identity
    recipes, manual rollback recovery procedures (aws acm
    import-certificate / az keyvault certificate import), CloudTrail
    + Activity Log forensics queries for "who wrote to this ARN /
    vault cert", Prometheus cardinality + cost budget, and the
    V3-Pro forward path (CloudFront / Front Door direct-attach,
    ALB / App Gateway auto-bind, soft-delete recovery, GCP CM).
  - migrations/seed_demo.sql — two new demo target rows (tgt-aws-
    acm-prod + tgt-azure-kv-prod) so QA can exercise the per-cloud
    wiring end-to-end against the demo seed without standing up
    real cloud accounts.

cowork/WORKSPACE-ROADMAP.md (sibling-folder, not in this commit's
diff) was updated to mark the V2 AWS ACM + Azure KV connectors as
shipped and document the V3-Pro CloudFront / Front Door direct-attach
+ App Gateway auto-bind + soft-delete recovery + GCP CM follow-on
items.

cowork/infisical-deep-research-results.md (sibling-folder) Part 5
Rank 5 marked CLOSED with both commit SHAs.

Doc-only commit. No code changes.

Verified locally:
- go test -short -count=1 ./internal/connector/target/awsacm/...
  ./internal/connector/target/azurekv/...  green.
- markdown lint clean against the Bundle 8 + Rank 4 runbook templates.
2026-05-03 22:46:29 +00:00
shankar0123 8a56a78282 target(azurekv): SDK-driven Azure Key Vault target connector
Closes Rank 5 (Azure half) of the 2026-05-03 Infisical deep-research
deliverable (cowork/infisical-deep-research-results.md Part 5).
Pre-fix, certctl had no path to deploy certs to Azure-managed TLS-
termination endpoints (Application Gateway / Front Door / App Service
/ Container Apps) — operators terminating TLS at Azure had to use
manual `az keyvault certificate import` invocations or external
automation. This commit lands the SDK-driven Azure Key Vault target
connector that closes the gap, mirroring the AWS ACM target shape
shipped in commit edf6bee.

Architecture:
  - internal/connector/target/azurekv/azurekv.go — Connector wraps
    *azcertificates.Client behind the KeyVaultClient interface seam
    (mirrors awsacm's ACMClient + awsacmpca's ACMPCAClient). Lives
    in azurekv.go alongside the PFX (PKCS#12) wrapping helper that
    bundles the operator-supplied PEM cert + chain + key into the
    base64-PFX wire format azcertificates.ImportCertificate accepts.
  - internal/connector/target/azurekv/sdk_client.go — SDK-loading
    code isolated so the test path (NewWithClient) compiles without
    pulling azcore + azidentity transitive deps into the test
    binary. DefaultAzureCredential / ManagedIdentityCredential /
    EnvironmentCredential / WorkloadIdentityCredential selected via
    Config.CredentialMode (closed enum).
  - Pre-deploy snapshot via GetCertificate(name, "" /* latest */) so
    on-import-failure rollback restores the previous cert. Mirrors
    Bundle 5+. The Azure-specific quirk: rollback creates a NEW
    VERSION (Key Vault doesn't support version-restore without
    soft-delete recovery, which we keep off the minimum-RBAC
    surface). Operators reading audit dashboards see e.g. v1=initial,
    v2=failed-renewal, v3=rollback-of-v2; the certctl-managed-by +
    certctl-certificate-id provenance tags + future certctl-rollback-of
    metadata tag let an operator filter rollback artifacts.
  - Provenance tags identical to AWS ACM
    (certctl-managed-by=certctl + certctl-certificate-id=<mc-id>),
    automatically applied on every import. Key Vault carries tags
    forward across versions (unlike ACM which strips on re-import),
    so no separate AddTags call is required.
  - DeploymentRequest.KeyPEM held in agent memory only; PFX wrapping
    happens in-memory via software.sslmate.com/src/go-pkcs12. No
    disk write.

Tests:
  - azurekv_test.go: 13-subtest happy-path + validation matrix —
    ValidateConfig (success / missing-vault-url / malformed-vault-
    url / missing-cert-name / invalid-credential-mode / reserved-
    tag rejection), DeployCertificate (fresh import / rollback-on-
    serial-mismatch / empty-key-rejected / no-client-rejected /
    SDK-error-surfaced), ValidateOnly (returns sentinel),
    ValidateDeployment (serial match / mismatch).
  - All tests use the NewWithClient injection seam; no real-Azure
    API calls.
  - go test -short -count=1 ./internal/connector/target/azurekv/...
    green.

Wiring:
  - internal/domain/connector.go: TargetTypeAzureKeyVault =
    "AzureKeyVault".
  - internal/service/target.go: validTargetTypes set extended.
  - cmd/agent/main.go::createTargetConnector: AzureKeyVault case
    arm mirroring the AWSACM shape exactly.
  - cmd/agent/agent_test.go::TestCreateTargetConnector_AllSupported
    Types: AzureKeyVault added to the type matrix + the InvalidJSON
    matrix (16 supported target types now, up from 15).

go.mod / go.sum:
  - github.com/Azure/azure-sdk-for-go/sdk/azcore v1.20.0 (direct).
  - github.com/Azure/azure-sdk-for-go/sdk/azidentity v1.13.1 (direct).
  - github.com/Azure/azure-sdk-for-go/sdk/security/keyvault/
    azcertificates v1.4.0 (direct). The deprecated
    /keyvault/azcertificates path appears as a transitive indirect
    via Microsoft's microsoft-authentication-library-for-go; we use
    the new /security/keyvault/ path exclusively.

Documentation:
  - docs/connectors.md "Azure Key Vault" section: config table, RBAC
    role recipe (off-the-shelf "Key Vault Certificates Officer" or
    custom role with 3 data-plane actions), AKS workload-identity /
    managed-identity / service-principal / default credential
    recipes, atomic-rollback contract + Azure-version semantics
    explanation, soft-delete caveat, App Gateway / Front Door
    Terraform attachment snippet, threat model carve-outs (no disk
    writes, mandatory provenance tags, no long-lived secrets in
    Config), 5-bullet procurement checklist crib.

Out of scope (intentional, flagged in V3-Pro forward path):
  - Azure Front Door direct-attach (UpdateRoutingConfig — different
    Azure RBAC scope).
  - App Gateway / App Service auto-bind (V3-Pro auto-attach).
  - Soft-delete recovery (acm:RecoverDeletedCertificate-equivalent
    requires extra RBAC; V2 keeps minimum-permission surface).
  - GCP Certificate Manager (separate cloud, separate connector).

Verified locally:
- gofmt clean.
- go vet ./internal/connector/target/azurekv/...
  ./internal/domain/... ./internal/service/...
  ./cmd/agent/...  clean.
- go test -short -count=1 ./internal/connector/target/azurekv/...
  ./cmd/agent/...  green (all 16 supported target types
  instantiate via the agent factory).

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 5.
Acquisition prompt:
cowork/rank-5-aws-acm-azure-kv-target-adapters-prompt.md.
Companion commit (AWS half): edf6bee.
2026-05-03 22:43:45 +00:00
shankar0123 edf6bee7f8 target(awsacm): SDK-driven AWS Certificate Manager target connector
Closes Rank 5 (AWS half) of the 2026-05-03 Infisical deep-research
deliverable (cowork/infisical-deep-research-results.md Part 5).
Pre-fix, certctl had no path to deploy certs to AWS-managed TLS-
termination endpoints (ALB / CloudFront / API Gateway / App Runner)
— operators terminating TLS at AWS had to use Infisical secret-sync,
manual aws-cli imports, or external automation. This commit lands
the SDK-driven AWS Certificate Manager target connector that closes
the gap end-to-end.

Architecture:
  - internal/connector/target/awsacm/awsacm.go — Connector wraps
    *acm.Client behind the ACMClient interface seam (mirrors
    awsacmpca's ACMPCAClient pattern from the issuer side).
    LoadDefaultConfig handles the standard AWS credential chain
    (IRSA / EC2 instance profile / SSO / env vars); no embedded
    creds in connector Config.
  - Pre-deploy snapshot via DescribeCertificate + GetCertificate so
    on-import-failure rollback restores the previous cert. Mirrors
    the Bundle 5 IIS pattern + the Bundle 7/8 WinCertStore /
    JavaKeystore patterns. Surfaces rollback success/failure via
    the existing certctl_deploy_rollback_total Prometheus counter
    label set.
  - Provenance tags: certctl-managed-by=certctl + certctl-
    certificate-id=<mc-id> set automatically on every import. ACM
    strips tags on re-import, so the connector calls
    AddTagsToCertificate post-import to keep the provenance pair
    fresh. Operators looking up a cert ARN by managed-cert ID
    (Terraform data source, CloudFormation output) match against
    these tags.
  - DeploymentRequest.KeyPEM held in agent memory only — never
    written to disk. Aligns with the pull-only deployment model
    documented in CLAUDE.md.

Tests:
  - awsacm_test.go: 15-subtest happy-path + validation matrix
    covering ValidateConfig (success / missing-region / malformed-
    region / malformed-ARN / reserved-tag rejection),
    DeployCertificate (fresh import / rotate-in-place / rollback-
    on-serial-mismatch / rollback-also-fails / empty-key-rejected /
    no-client-rejected), ValidateOnly (returns sentinel),
    ValidateDeployment (serial match / mismatch / no-ARN-yet).
  - awsacm_failure_test.go: 5 per-error-class contract tests
    mirroring the awsacmpca_failure_test.go shape (commit
    a2a59a8) — AccessDeniedException (smithy.GenericAPIError),
    ResourceNotFoundException (typed), ThrottlingException
    (smithy.GenericAPIError, FaultServer preserved),
    InvalidArgsException (typed, terminal), RequestInProgress
    Exception (typed). All assert errors.As against the SDK type +
    operator-actionable substring + connector-side wrap framing.
  - Coverage on awsacm.go: 54.9% of statements (matches the K8s-
    Secret + IIS connectors' 50-65% range; rollback-failure paths
    contribute most of the un-covered surface — those exercise
    only when the rollback's SDK call also returns an error).
  - go test -race -count=10 green; no goroutine leaks.

Wiring:
  - internal/domain/connector.go: TargetTypeAWSACM = "AWSACM".
  - internal/service/target.go: validTargetTypes set extended.
  - cmd/agent/main.go::createTargetConnector: AWSACM case arm
    mirroring the KubernetesSecrets shape exactly. Calls
    awsacm.New(context.Background(), &cfg, a.logger) — the
    SDK-loading happens here, not lazily, so config errors
    surface at agent boot.
  - cmd/agent/agent_test.go::TestCreateTargetConnector_AllSupported
    Types: AWSACM added to the type matrix + the InvalidJSON
    matrix.

go.mod / go.sum:
  - github.com/aws/aws-sdk-go-v2/service/acm v1.38.3 (direct).
    aws-sdk-go-v2 + service/acmpca + smithy-go were already direct
    from the awsacmpca issuer; this is the distribution-side
    companion package.

Documentation:
  - docs/connectors.md "AWS Certificate Manager (ACM)" section:
    config table, IAM policy JSON (5 actions on
    arn:aws:acm:*:*:certificate/*), IRSA / EC2 instance-profile /
    SSO auth recipes, atomic-rollback contract, Terraform ALB-
    attachment snippet, threat model carve-outs (no disk writes,
    mandatory provenance tags, no long-lived creds in Config),
    procurement checklist crib (5 bullets paste-able into a
    security review).

Out of scope (intentional, flagged in V3-Pro forward path):
  - CloudFront / ALB auto-attach (UpdateDistribution requires a
    different IAM scope than ACM ImportCertificate).
  - Cross-region ACM replication (ACM is regional; CloudFront
    forces us-east-1).
  - Tag-filtered ARN discovery (V2 uses operator-pinned
    Config.CertificateArn after first deploy; tag-scan path
    requires acm:ListTagsForCertificate which we deliberately
    keep off the minimum-IAM-policy surface).
  - Azure Key Vault (separate cloud, separate connector — Azure
    half of Rank 5 ships in a follow-on commit).

Verified locally:
- gofmt clean.
- go vet ./internal/connector/target/awsacm/...
  ./internal/domain/... ./internal/service/...
  ./cmd/agent/...  clean.
- go test -short -count=1 ./internal/connector/target/awsacm/...
  ./internal/domain/... ./cmd/agent/...  green (15 + 5 awsacm
  subtests; all 15 supported target types instantiate via the
  agent factory).
- go test -race -count=10 ./internal/connector/target/awsacm/...
  green.

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 5.
Acquisition prompt:
cowork/rank-5-aws-acm-azure-kv-target-adapters-prompt.md.
2026-05-03 22:32:45 +00:00
shankar0123 109f32ff41 notifications: per-policy multi-channel expiry-alert routing
Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable
(see cowork/infisical-deep-research-results.md Part 5). Pre-fix,
RenewalService.CheckExpiringCertificates already ran daily,
RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and
NotificationService.SendThresholdAlert deduped per (cert, threshold)
— but the channel was hardcoded to Email
(internal/service/notification.go:118 pre-fix). Operators who
configured PagerDuty / Slack / Teams / OpsGenie via
CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold
unless SMTP was also wired. Their first signal of an expired cert
was a 3 AM outage.

This commit lands the routing matrix on top of the existing
infrastructure:

  1. RenewalPolicy gains AlertChannels (per-tier channel list) +
     AlertSeverityMap (per-threshold tier assignment) +
     EffectiveAlertChannels / EffectiveAlertSeverity accessors.
     Default*() helpers preserve the back-compat Email-only
     behaviour for operators who haven't touched their policies
     post-upgrade. Migration 000026 adds the JSONB columns
     idempotently.
  2. NotificationService.SendThresholdAlertOnChannel — the new
     per-channel dispatch helper. Old SendThresholdAlert stays as
     an Email-only alias so non-policy callers (admin "send test
     alert" surfaces) keep working byte-for-byte.
  3. NotificationService.HasThresholdNotificationOnChannel — per-
     (cert, threshold, channel) deduplication so a transient
     PagerDuty 5xx today does NOT suppress today's Slack alert and
     tomorrow's PagerDuty retry will still fire.
  4. RenewalService.sendThresholdAlerts walks the resolved channel
     set per threshold tier, fans out to every configured channel,
     handles per-channel failures independently, defensively drops
     off-enum channels with an audit row trail, and records a per-
     channel audit event with metadata.channel + metadata.severity_tier.
  5. service.ExpiryAlertMetrics — atomic counter table mirrored on
     the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5
     (commit 0792271). Three labels: channel × threshold × result
     (success / failure / deduped). Cardinality bound: 6 × 4 × 3 =
     72 series for the standard 4-threshold matrix.
  6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus
     exposer for certctl_expiry_alerts_total{channel,threshold,result}.
     Pre-sorted snapshot for byte-stable emission.
  7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics
     instance through both the recording side (notificationService.
     SetExpiryAlertMetrics) and the exposing side
     (metricsHandler.SetExpiryAlerts).

Dispatch flow (post-fix, per renewal-loop tick):

  cert ages past T-30  → daily renewal-loop fires
                       → policy lookup
                       → for each crossed threshold:
                           - resolve severity tier (informational/
                             warning/critical) via AlertSeverityMap
                           - look up channel set in AlertChannels[tier]
                           - for each channel: dedup → SendThresholdAlertOnChannel
                             → notifierRegistry[channel] → audit row →
                             Prometheus counter increment

Tests (internal/service/renewal_expiry_alerts_test.go):

  TestExpiryAlerts_DefaultMatrix_EmailOnly
  TestExpiryAlerts_PerTierFanOut
  TestExpiryAlerts_PerChannelDedup
  TestExpiryAlerts_OneChannelFails_OthersStillFire
  TestExpiryAlerts_OffEnumChannelDropped
  TestExpiryAlerts_MetricCounterIncrements
  TestExpiryAlerts_NilPolicy_FallsToDefault
  TestExpiryAlerts_OperatorOptOutOfTier

The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0
days through the canonical 4 thresholds with the matrix
{informational:[Slack], warning:[Slack,Email],
critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact
recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no
Teams, no Webhook. The OneChannelFails test pins that PagerDuty
returning a 503 does NOT skip Slack/Email at the same threshold.

Drive-by fix (internal/service/testutil_test.go): the existing
mockNotifRepo.List ignored its filter and returned all rows, which
let legacy tests pass on dedup-via-substring even though the
postgres repo actually applied the filter. Updated the mock to
honour CertificateID / Type / Status / Channel / MessageLike
filters in the same shape as the postgres implementation
(internal/repository/postgres/notification.go). All pre-existing
service tests still pass — the legacy test suite happened to be
robust to the mock filter doing nothing.

Documentation:
  - docs/connectors.md Notifier section gains "Routing expiry
    alerts across channels" — operator-facing, JSON example,
    procurement playbook ("How do I make sure PagerDuty pages on
    the T-1 alert?"), debug recipe via SQL on audit_events +
    notification_events + Prometheus.
  - docs/runbook-expiry-alerts.md — sysadmin-grade flowchart,
    per-policy channel-matrix configuration recipes, "did the on-
    call team get paged?" SQL queries, cardinality budget, V3-Pro
    forward path.
  - cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry
    alerts: per-owner routing" V3-Pro entry under Adapter
    hardening.

Out of scope (intentional, flagged in V3-Pro forward path):
  - Per-owner / per-team / per-tenant channel routing (matrix is
    per-policy today, not per-owner).
  - Calendar-aware suppression (no T-30 alerts on weekends).
  - Escalation chains (T-1 unanswered for 30m → escalate).
  - Per-channel rate limiting (downstream of I-005 retry+DLQ).

CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md
itself ("no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/domain/... ./internal/service/...
  ./internal/api/handler/... ./cmd/server/...  clean.
  (./internal/repository/postgres/... vet failed on transitive
  testcontainers/docker module download — sandbox disk pressure,
  not a code issue; postgres-repo build succeeds and tests pass.)
- go test -short -count=1 ./internal/domain/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestExpiryAlerts'
  ./internal/service/...  green (per-channel dedup race-free).

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4.
Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.
2026-05-03 22:12:32 +00:00
shankar0123 022caf39b4 ci(googlecas): fix QF1002 staticcheck — tagged switch on r.URL.Path
CI failure on commit a2a59a8 (run #423):

  internal/connector/issuer/googlecas/googlecas_failure_test.go:189:3:
    QF1002: could use tagged switch on r.URL.Path (staticcheck)

The OAuth2 token-refresh test handler had two cases — `r.URL.Path ==
"/token"` and `default` — both equality-against-r.URL.Path. Stati-
ccheck's QF1002 rule wants this expressed as a tagged switch:

  switch r.URL.Path {
  case "/token":
      ...
  default:
      ...
  }

The other four switches in the same file are mixed equality + Contains
(`case r.URL.Path == "/token":` + `case strings.Contains(r.URL.Path,
"/certificates"):`) — those are not tag-able and stay on
`switch { case ... }`. Only the OAuth2 test handler had the single-
equality-case pattern QF1002 fires on.

Test-only commit. No production code change.

Verified locally:
- gofmt clean.
- go test -short -count=1 ./internal/connector/issuer/googlecas/...
  green (5 failure tests + 14 happy-path subtests + 4 stub tests).
2026-05-03 21:32:55 +00:00
shankar0123 869fc8f245 docs(openssl): operator playbook for shell-out threat model
Closes Top-10 fix #6 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
OpenSSL adapter's docs in docs/connectors.md explained usage but
did NOT enumerate the threat model. The adapter exec's an arbitrary
operator-supplied script — env-var inheritance, symlink attacks,
sandbox-escape, multi-tenant process-isolation gaps. An acquirer's
security reviewer reading this surface cold pattern-matches
"highest-risk issuer surface with the lowest documented threat
model."

This commit lands a doc-side operator playbook in
docs/connectors.md OpenSSL section (mirrors Bundle 8's "Operator
playbook: keytool argv password exposure" subsection shape and
the 2026-05-02 audit Top-10 fix #7 SSH InsecureIgnoreHostKey
playbook). Six topics covered:

  1. Why the adapter exists despite the risk (CLI-driven CAs
     without Go SDKs need an integration path).
  2. Threat model the adapter accepts (trusted operator + trusted
     script + appropriate ownership + clear audit trail).
  3. Threat model the adapter does NOT accept (operator-writable
     script paths, untrusted content, multi-tenant hosts).
  4. Mitigations operators can layer (dedicated user, root-owned
     0755 binary, audit rules, per-call timeout via
     CERTCTL_OPENSSL_TIMEOUT_SECONDS, env sanitisation,
     chroot/container, audit wrapper, per-call concurrency
     bound).
  5. When NOT to use the adapter (compliance environments,
     multi-tenant servers, no-script-review environments).
  6. V3-Pro forward path (hardened mode tracked in
     cowork/WORKSPACE-ROADMAP.md).

Inline comment in internal/connector/issuer/openssl/openssl.go
near the callSignScript exec call site forward-references the
new doc subsection (no logic change).

cowork/WORKSPACE-ROADMAP.md gains an "OpenSSL hardened mode" V3-
Pro entry under "Adapter hardening" — sibling-folder doc, not in
the certctl repo, so not reflected in this commit's diff.

Same shape Bundle 8 used for the JavaKeystore playbook and the
2026-05-02 deployment-target audit Top-10 fix #7 used for the SSH
InsecureIgnoreHostKey playbook.

No code logic changes (only the explanatory comment near the
exec call site). No test changes. Doc-only commit.

Verified locally:
- gofmt / go vet clean.
- go test -short -count=1 ./internal/connector/issuer/openssl/...
  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #6.
2026-05-03 21:28:05 +00:00
shankar0123 0792271dc6 vault: add automatic token renewal at TTL/2 + Prometheus metric
Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
VaultPKI adapter authenticated with a static token and never called
renew-self. Long-lived deploys hit token expiry; the first
operator-visible signal was failed cert renewals on production
targets.

This commit:

  1. Connector.Start(ctx) spawns a goroutine that calls
     POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a
     one-shot lookup-self at startup). Honours ctx.Done() for
     graceful shutdown via a per-loop done channel + Stop().
  2. On `renewable: false` response (initial lookup OR any subsequent
     renewal), the loop emits a WARN, increments the not_renewable
     counter, and exits. The operator must rotate the token before
     Vault's Max TTL elapses.
  3. New Prometheus counter certctl_vault_token_renewals_total with
     labels result={success,failure,not_renewable}. Registered
     alongside existing certctl_issuance_* counters in
     internal/api/handler/metrics.go.
  4. ERROR-level logging on renewal failure with operator-actionable
     substring ("vault token renewal failed; rotate the token before
     TTL expires") so journalctl + grep find it. Loop keeps ticking
     after a failure — transient blips don't kill it.

New optional issuer.Lifecycle interface:

  type Lifecycle interface {
      Start(ctx context.Context) error
      Stop()
  }

Connectors that hold no background goroutines (almost all of them)
do not implement this — IssuerRegistry.StartLifecycles /
StopLifecycles feature-detect via type assertion. New
lifecycle-bearing connectors plug in by implementing the interface;
no further registry plumbing required.

Wiring (cmd/server/main.go):

  - service.NewVaultRenewalMetrics() instance is shared between
    issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built
    by Rebuild get a recorder) and metricsHandler.SetVaultRenewals
    (so the Prometheus exposer emits the new series).
  - issuerRegistry.StartLifecycles(ctx) is called after
    issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles
    is paired so goroutines exit cleanly on signal.
  - IssuerConnectorAdapter.Underlying() exposes the wrapped
    issuer.Connector so registry-level machinery can reach the
    concrete connector behind the adapter without duplicating the
    wiring at every call site.

Tests (internal/connector/issuer/vault/vault_renew_test.go):

  - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three
    renewals, all "success".
  - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns
    renewable=false, loop exits, third tick fires no HTTP call.
  - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403
    bumps "failure", second renewal succeeds → loop kept ticking.
  - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns
    within 200ms after ctx cancel.
  - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token
    already non-renewable at boot ⇒ no goroutine, "not_renewable"
    metric increments at startup so operators see it in Grafana.
  - TestVault_ComputeInterval — 4 cases pinning TTL/2 +
    minRenewInterval floor.
  - TestVault_RenewSelf_ParseFailure_NamesActionableInError —
    surfaced error contains "vault token renewal failed" + "rotate
    the token".

Cadence is dynamic — every successful renewal re-derives TTL/2
from the renewed lease's lease_duration, so a short bootstrap
token that gets renewed up to a longer Max TTL shifts to the
longer cadence automatically (defends against degenerate fast
ticking on a token whose Max TTL is far longer than its initial
TTL).

Documentation:
  - docs/connectors.md Vault PKI section gains "Token TTL +
    automatic renewal" subsection (operator-facing: cadence, metric,
    renewable=false rotation playbook).

Out of scope (intentional, flagged in the audit follow-up):
  - AppRole / Kubernetes / AWS IAM auth methods (different renewal
    semantics).
  - Hot-reload of rotated token from disk (operator restarts
    today; future: GUI/MCP issuer-update path triggers Rebuild
    which Stops the old connector and Starts the new one).
  - Auto-re-auth after token death (operator playbook owns it).

CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md
itself: "no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/service/... ./internal/api/handler/...
  ./internal/connector/issuer/vault/... ./cmd/server/...  clean.
- go test -short -count=1 ./internal/connector/issuer/vault/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval'
  ./internal/connector/issuer/vault/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #5.
2026-05-03 21:24:27 +00:00
shankar0123 a2a59a823e googlecas, awsacmpca: add failure_test.go covering cloud-SDK error contracts
Closes Top-10 fix #4 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, both
adapters had only happy-path test coverage with a single generic
ServerError pair each. Cloud CAs are typically the first-deployed
issuer in enterprise pilots; their diligence reviews dig hard into
IAM-error / cloud-error coverage. This commit lands the contract
tests.

AWSACMPCA — 5 tests in awsacmpca_failure_test.go. Each injects a
typed AWS SDK v2 error via the existing mockACMPCAClient seam and
asserts (1) error non-nil, (2) errors.As against the SDK's typed
value succeeds (so the wrap chain through fmt.Errorf("...%w", ...)
is intact), and (3) operator-actionable substring is present.

  1. Issue_AccessDenied — *smithy.GenericAPIError with
     Code="AccessDeniedException" (the SDK does NOT generate a
     typed *types.AccessDeniedException; AWS uses the smithy
     APIError shape for IAM denials). Asserts ErrorCode +
     "not authorized" + IAM resource path preserved through wrap.
  2. Issue_ResourceNotFound — *types.ResourceNotFoundException
     names the missing CA ARN.
  3. Issue_Throttling — *smithy.GenericAPIError with
     Code="ThrottlingException", Fault=FaultServer. Asserts the
     retryable class (FaultServer) is preserved through wrap so
     upstream retry logic can engage.
  4. Issue_MalformedCSR — *types.MalformedCSRException is terminal
     (operator must fix the CSR, not retry); asserts the
     validation-issue substring survives.
  5. Issue_RequestInProgress — *types.RequestInProgressException
     wraps cleanly; classification (retry vs reissue) is upstream's
     responsibility per the spec's "no new retry logic" rule.

GoogleCAS — 5 tests in googlecas_failure_test.go. The adapter uses
stdlib net/http directly (NO Google Cloud Go SDK dependency in
googlecas.go), so SDK typed-error assertions don't translate. Each
test runs an httptest.Server that returns the canonical Google API
JSON error envelope:

  {"error":{"code":N,"message":"...","status":"<STATUS>"}}

and asserts (1) error non-nil, (2) operator-actionable substring,
and (3) the canonical status string ("PERMISSION_DENIED",
"NOT_FOUND", "UNAVAILABLE") survives the wrap chain so upstream
classification can branch on it.

  1. Issue_PermissionDenied — 403 / PERMISSION_DENIED; surfaced
     error names the IAM resource path.
  2. Issue_CAPoolNotFound — 404 / NOT_FOUND; surfaced error names
     the missing pool resource.
  3. Issue_OAuth2TokenRefreshFailure — token endpoint returns 401
     invalid_grant; surfaced error mentions "token" so an operator
     reading the log immediately distinguishes a credential failure
     (rotate SA key) from a CA-side error (fix IAM binding). Test
     also asserts the CAS endpoint is NOT reached when the token
     exchange fails.
  4. Issue_RegionalAPIUnavailable — 503 / UNAVAILABLE; surfaced
     error preserves the retryable class markers (status code +
     UNAVAILABLE string) for upstream retry classification.
  5. Revoke_PermissionDenied — adapter does NOT silently swallow
     the failure; pin the contract so the audit-row atomicity
     guarantee from Bundle G (which lives in the service-layer
     wrapper, not the adapter) continues to apply. Test also
     verifies the revoke endpoint was actually reached, guarding
     against a future regression that short-circuits before the
     HTTP call.

Coverage delta:
  awsacmpca: 71.0% → 71.0% (failure tests reuse existing wrap
    code paths; behaviour-pin contract tests, not coverage tests).
  googlecas: 83.4% → 84.4% (+1.0pp).

go.mod: smithy-go moved indirect → direct, since the new AWSACMPCA
test file imports it. CI's go-mod-tidy-drift gate enforces this.

Test-only commit. No production code changes.

Verified locally:
  - gofmt clean.
  - go vet ./internal/connector/issuer/awsacmpca/...
    ./internal/connector/issuer/googlecas/...  clean.
  - go test -short -count=1 ./internal/connector/issuer/...  green.
  - go test -race -count=10 ./internal/connector/issuer/awsacmpca
    ./internal/connector/issuer/googlecas  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #4.
2026-05-03 21:10:41 +00:00
shankar0123 b0c4ed1ae2 openssl: add failure_test.go covering 6 shell-out error modes
Closes Top-10 fix #3 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
OpenSSL adapter (497 LOC, certctl's highest-risk issuer surface)
had openssl_test.go (8 happy-path funcs + 20 subtests) but no
dedicated _failure_test.go. Compare to ACME, Vault, DigiCert,
Sectigo, Entrust, GlobalSign, EJBCA — all peers have one. An
acquirer's diligence team flags this as an immediate blocker on
the highest-risk issuer surface.

This commit adds 6 failure-mode tests:

  1. TestOpenSSL_Issue_ScriptNotFound_OperatorActionableError —
     SignScript path doesn't exist; error wraps os.ErrNotExist
     (errors.Is); message contains 'no such file' / 'not found'
     so the operator's grep finds it in journalctl.
  2. TestOpenSSL_Issue_PermissionDenied_OperatorActionableError —
     SignScript exists with mode 0o600 (non-executable); error
     wraps os.ErrPermission; message contains 'permission'.
     Skipped under root (uid 0 bypasses chmod gating).
  3. TestOpenSSL_Issue_MalformedStdout_DistinguishedFromCSRReject
     — script exits 0 + writes garbage (no PEM markers) to the
     cert output file; error mentions PEM/certificate/parse so
     operators distinguish output-parsing failure from a script-
     side fault.
  4. TestOpenSSL_Issue_NonZeroExit_DistinguishesCAReject_From_
     ScriptError — script writes 'policy violation: …' to stderr
     and exits 2 (CA-side rejection convention); the script's
     stderr surfaces in the error message; errors.Unwrap returns
     non-nil (proving the underlying *exec.ExitError chain
     survives).
  5. TestOpenSSL_Issue_TimeoutEnforced_ContextCancellationPropagates
     — script does 'exec sleep 30' (not 'sleep 30 ' as a child;
     exec replaces bash so SIGKILL goes directly to the sleeper,
     avoiding the orphan-pipes corner case where a killed bash
     leaves sleep holding stdout/stderr open and CombinedOutput
     blocks); ctx with 100ms deadline; call returns within ~5s
     wall-clock; either errors.Is(err, context.DeadlineExceeded)
     or the error message names 'killed' / 'signal'.
  6. TestOpenSSL_Issue_SignalKilled_PartialOutputDiscarded —
     script writes a half-PEM ('-----BEGIN CERTIFICATE-----\nMII…')
     then 'kill -KILL $$'; assertion: result is nil OR
     CertPEM is empty (no half-cert leaks to caller); error
     names 'signal' / 'killed' OR 'PEM' / 'parse' (both are
     operator-actionable).

Each test pins the operator-actionable error message contract:
the message names the failure mode (so journalctl + grep find
it) and proves no half-state was created (no partial cert
returned). errors.Is / errors.Unwrap checks confirm the wrapping
chain survives.

The OpenSSL adapter has no commandRunner abstraction (production
code uses exec.CommandContext directly); these tests use real
operator-supplied scripts written to t.TempDir (matches the
adapter's actual production code path; no os/exec mocking). The
'exec sleep 30' technique in Test 5 is the load-bearing fix for
the bash-orphans-sleep-and-pipes-stay-open corner case that
otherwise makes the test take 30s instead of 100ms.

Coverage delta:
  - Before this commit: openssl_test.go + openssl_stubs_test.go
    covered 8 happy-path funcs.
  - After: 79.8% statement coverage of openssl.go (up from
    operator-pre-existing baseline; the 6 new tests exercise
    every error path through callSignScript + parseCertificate).

Tests pass clean under '-race -count=10' (Test 5's deadline
tolerance is the only timing-sensitive case; the 5s wall-clock
budget vs the 100ms ctx deadline gives ample slack on slow CI
without masking deadline-not-enforced bugs).

Test-only commit; no production code changes. Hardening fixes
(per-call concurrency semaphore, threat-model docs) are separate
Top-10 entries.

Verified locally:
  - gofmt clean across the repo.
  - go vet ./... clean across the repo.
  - go test -race -count=10 -short
    ./internal/connector/issuer/openssl/... green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/
RESULTS.md Top-10 fix #3.
2026-05-03 20:55:44 +00:00
shankar0123 d3bf2cc0cf vault, digicert: migrate Token / APIKey to *secret.Ref (Bundle I Phase 3)
Closes Top-10 fix #2 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix,
vault.Config.Token and digicert.Config.APIKey were plain string
fields. Practical impact:

  1. GET /api/v1/issuers responses marshalled the credential into
     the JSON body. An acquirer's procurement engineer running
     'curl /api/v1/issuers | jq' saw the token / API key in plain
     text on screen.
  2. DEBUG-level HTTP request logging printed the credential
     header verbatim.
  3. A heap dump of the running server contained the credential
     as readable bytes for the lifetime of the process.

Bundle I from the 2026-05-01 audit closed this for AWSACMPCA,
EJBCA, GlobalSign, Sectigo (Phase 1+2). Vault and DigiCert were
left out. This commit ports the same migration onto them.

Mechanics:
  - Config.Token / Config.APIKey type changed from 'string' to
    '*secret.Ref'. UnmarshalJSON of a JSON string populates the
    Ref via NewRefFromString — operator config files are
    unchanged.
  - Every header-write call site routed through Ref.Use, with the
    byte buffer zeroed after the callback returns. Vault: 3 sites
    (IssueCertificate, RevokeCertificate, GetCACertPEM). DigiCert:
    5 sites (ValidateConfig, IssueCertificate, RevokeCertificate,
    pollOrderOnce, downloadCertificate).
  - ValidateConfig nil-checks switch from 'cfg.Token == ""' to
    'cfg.Token.IsEmpty()' (mirrors Sectigo's existing pattern).
  - Tests migrated: every Config{Token:"..."} →
    Config{Token: secret.NewRefFromString("...")}. The
    'json.Marshal(config) → ValidateConfig(rawConfig)' round-trip
    pattern in DigiCert's ValidateConfig_Success test is now
    broken by the redact-on-marshal contract — switched that one
    to construct the rawConfig as a JSON literal (mirrors
    Sectigo's existing test pattern).
  - Two new tests pin the redact-on-marshal contract:
      - TestVault_Config_TokenMarshalsAsRedacted (vault_redact_test.go)
      - TestDigiCert_Config_APIKeyMarshalsAsRedacted (digicert_redact_test.go)
    Both assert the marshaled JSON contains '"[redacted]"' and
    does NOT contain the plaintext bytes.

Operator-visible: GET /api/v1/issuers responses for type=vault
and type=digicert now show the credential as '[redacted]'.
Existing config files keep working — the Ref unmarshal accepts
strings.

CHANGELOG note: certctl/CHANGELOG.md is intentionally not
hand-edited; release notes are auto-generated from commit
messages between consecutive tags. This commit's message body is
the release-note artifact.

Verified locally:
  - gofmt clean across the repo.
  - go vet ./... clean across the repo.
  - go test -race -count=1 -short
    ./internal/connector/issuer/vault/...
    ./internal/connector/issuer/digicert/...
    ./internal/secret/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/
RESULTS.md Top-10 fix #2.
2026-05-03 20:49:23 +00:00
shankar0123 81f6321326 ejbca: port mTLS keypair to mtlscache (close Bundle M for the last issuer)
Closes Top-10 fix #1 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix,
ejbca.go::New called tls.LoadX509KeyPair once at construction and
configured the keypair into *http.Transport.TLSClientConfig with
no mtime watch. mTLS rotation required a server restart — quarterly
rotation per any reasonable security policy = quarterly deploy
outage.

Bundle M from the prior 2026-05-01 audit shipped the mtlscache
helper at internal/connector/issuer/mtlscache/cache.go and wired
it into Entrust + GlobalSign. EJBCA was missed in Bundle M's
scope. This commit ports the same helper onto EJBCA's
auth_mode=mtls path. The OAuth2 path is unchanged.

Implementation:
  - New imports internal/connector/issuer/mtlscache.
  - Connector struct gains an mtls *mtlscache.Cache field
    (mirroring Entrust + GlobalSign).
  - New()'s case 'mtls': replaces tls.LoadX509KeyPair + manual
    *http.Transport with mtlscache.New(certPath, keyPath,
    Options{HTTPTimeout: 30s}). Cache build happens at construction
    so misconfigured operators fail fast (matches pre-fix
    behaviour).
  - New helper getHTTPClient() returns the cached client; on the
    mTLS path it calls RefreshIfStale before returning so the
    next request uses the new keypair if disk has rotated. On
    OAuth2 / test paths (c.mtls == nil), returns c.httpClient
    as-is.
  - All 3 c.httpClient.Do call sites (IssueCertificate enroll,
    RevokeCertificate revoke, GetOrderStatus cert lookup) replaced
    with c.getHTTPClient() + client.Do.
  - crypto/tls import removed (no longer used at this layer).

Tests:
  - TestEJBCA_MTLSKeypairRotation_PicksUpNewCertWithoutRestart
    (new, ejbca_mtls_rotation_test.go): generates two CAs (caA,
    caB), signs leafA + leafB, spins up an httptest TLS server
    that trusts both CAs and records the issuer DN of every
    presented client cert, writes leafA, makes request 1, writes
    leafB + advances mtime by 2s, makes request 2. Asserts the
    server saw caA's DN on req 1 and caB's DN on req 2 — the
    cache picked up the rotation without ejbca.New re-running.
  - export_test.go: GetHTTPClientForTest helper exposes the
    private getHTTPClient so the rotation test drives the
    production code path.
  - All existing EJBCA tests still pass (TestNew_MTLSWiresClientCert,
    TestNew_MTLSCertLoadFailure, TestNew_OAuth2NoTransportTuning,
    TestNew_InvalidAuthMode).

Verified locally:
  - gofmt clean across the repo.
  - go vet ./... clean across the repo.
  - go test -race -count=1 -short ./internal/connector/issuer/ejbca/...
    ./internal/connector/issuer/mtlscache/... green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/
RESULTS.md Top-10 fix #1.
2026-05-03 20:38:19 +00:00
shankar0123 39f065dda4 docs(acme-server): operator-facing reference + threat model + cert-manager walkthrough (Phase 6/7)
Doc-only commit closing the ACME-server work series. After this commit,
an outside reviewer (procurement engineer / Venafi diligence engineer /
Infisical-comparison-shopper) can read the docs cold, understand the
ACME server's surface, follow the cert-manager walkthrough, and reach
a deployment decision without escalating to certctl maintainers.

What ships:
  - docs/acme-server.md final pass: Auth-mode decision tree (when to
    use trust_authenticated vs challenge), RFC 8555 + RFC 9773
    conformance statement (section-by-section table of implemented
    plus procurement-honest 'not implemented' rows for EAB / multi-
    level wildcards / RFC 8738 / cross-CA proxying), Troubleshooting
    (5 failure modes — badNonce / unknownAuthority / HTTP-01
    connection refused / DNS-01 NXDOMAIN / rejectedIdentifier with
    canonical fix for each), Version pinning + tested clients table
    (cert-manager 1.15.0, lego v4, kind v0.20+, Caddy 2.7.x, Traefik
    3.0+), FAQ (5 entries — why two auth modes, vs cert-manager-
    against-LE, can-I-use-from-outside-K8s, migration story, audit-
    log catalog), See-also cross-link block.
  - docs/acme-cert-manager-walkthrough.md: kind → cert-manager →
    certctl → Certificate flow, with YAML blocks byte-equal to
    deploy/test/acme-integration/{clusterissuer-trust-authenticated,
    certificate-test}.yaml to prevent doc/test drift.
  - docs/acme-caddy-walkthrough.md: Caddyfile acme_ca + tls.cas
    options (OS trust store + Caddy pki.ca block).
  - docs/acme-traefik-walkthrough.md: certificatesResolvers.<name>.acme
    .caServer + serversTransport.rootCAs configuration.
  - docs/acme-server-threat-model.md: Threat surface map + JWS forgery
    resistance (alg-confusion / HS256 substitution / replayed nonce /
    URL spoofing / multi-sig / kid-vs-jwk / kid round-trip mismatch),
    Nonce store integrity rationale, HTTP-01 SSRF defense-in-depth
    (pre-dial check + per-dial check + per-redirect check + body cap +
    bounded redirects), DNS-01 cache-poisoning posture (default Google
    Public DNS + operator-owns-private-resolver-posture), TLS-ALPN-01
    chain-not-validated rationale (RFC 8737 §3 explicit), Rate-limit
    tuning, Audit trail catalog, Out-of-scope threats list.
  - docs/connectors.md: TOC renumbered 3→4 etc. to make room for new
    top-level 'ACME Server (Built-in)' section between Issuer Connector
    and Target Connector — distinguishes the consumer-side ACME
    (existing) from the new server-side ACME via env-var-prefix
    call-out (CERTCTL_ACME_* vs CERTCTL_ACME_SERVER_*).

DoD verification:
  - All 5 docs files exist with the structure prescribed by the
    Phase 6 prompt.
  - Every CERTCTL_ACME_SERVER_* env var in docs/acme-server.md maps
    to an actual lookup in internal/config/config.go (verified by
    'grep -oE | sort -u | diff' returning empty).
  - Every YAML snippet in docs/acme-cert-manager-walkthrough.md is
    byte-equal to the corresponding file in deploy/test/acme-integration/
    (verified with 'diff' against awk-extracted YAML blocks).
  - docs/connectors.md has the cross-link subsection with all 4 new
    docs referenced.
  - cowork/CLAUDE.md Architecture Decisions has the new ACME-server
    bullet documenting per-profile URL family + per-profile
    acme_auth_mode + Phase 4-5-6 progression.
  - cowork/WORKSPACE-CHANGELOG.md has the ACME-Server-6 entry plus
    the ACME-Server rollup spanning Phases 1a-6.
  - cowork/infisical-deep-research-results.md Rank 1 marked SHIPPED.
  - 'gofmt -l .' clean (no Go changes); 'go vet ./...' clean.

Acquisition-readiness: every one of the 12 acquisition-grade criteria
from cowork/acme-server-endpoint-prompt.md is verified by the test
suite (Phases 1a-5) plus this doc walkthrough (Phase 6). The full
RFC 8555 + RFC 9773 surface is live; the operator can deploy
end-to-end by reading one walkthrough doc and one env-var table.

Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-6 (docs)'
+ ACME-Server rollup of all 6 phases.
2026-05-03 19:58:15 +00:00