Commit Graph

300 Commits

Author SHA1 Message Date
shankar0123 8b75e0311b chore: rename Go module path to github.com/certctl-io/certctl
Mechanical sed across the main go.mod's module declaration, the f5-mock-icontrol
sub-module's go.mod, every Go file's import path (361 files), and a rebuild of
the checked-in f5-mock-icontrol binary so its embedded build-info reflects the
new module path. No behavior change.

Choice B from cowork/transfer-certctl-to-org.md, executed 2026-05-04. Choice A
(keep module path declared as github.com/shankar0123/certctl regardless of
repo URL) shipped on the day of the org transfer (2026-05-03) since we had no
external Go consumers; this commit closes that deferral.

Backward-compat: GitHub HTTP redirects continue to forward
github.com/shankar0123/certctl → github.com/certctl-io/certctl at the URL
level, but Go's module proxy uses the path declared in go.mod as the
canonical name. Pre-fix, anyone trying `go get github.com/certctl-io/certctl/...`
hit a "module path mismatch" error because go.mod said
github.com/shankar0123/certctl and the URL they fetched it from said
certctl-io/certctl. Post-fix, the canonical name and the URL agree, so
go get / go install / external Go consumers / Go-tooling integrations
work cleanly via either the new path (preferred) or the old path (which
redirects and Go follows the redirect for source fetch).

Anyone still importing the old path inside their own code keeps working
provided they update their go.mod's `require` line to match — the module
path declared in their consumer's go.sum / go.mod is the authoritative
import name, so a mass sed across their import statements is the migration
on the consumer side. No external consumers exist today.

Diff shape:
  361 *.go files  — import path replacement only
    2 go.mod     — module declaration replacement only
    1 binary     — deploy/test/f5-mock-icontrol/f5-mock-icontrol rebuilt
                   so embedded build-info reflects the new path (8618965 vs
                   8618933 bytes; 32-byte diff is the build-info change)

  Total: 364 files, 730 insertions / 730 deletions, net-zero size, pure
  mechanical substitution.

Verification:
  gofmt: 17 files needed re-alignment after sed (the new path is one char
    shorter than the old, so column-aligned import groups drifted). Applied
    `gofmt -w` to fix.
  go mod tidy: clean exit on both modules.
  go vet ./...: clean exit.
  go build ./...: clean exit.
  go test -short -count=1 on representative packages: all green
    (internal/domain, internal/validation, internal/crypto, internal/crypto/signer,
    cmd/agent). Test output now reads `ok github.com/certctl-io/certctl/...`
    confirming the module path resolves correctly.
  binary: f5-mock-icontrol rebuilt; `strings | grep shankar0123` returns
    nothing; `strings | grep certctl-io/certctl` shows the new module path
    embedded in build-info.

Files intentionally NOT touched in this commit:
  README.md / CHANGELOG.md / docs/ / etc. — already swept to certctl-io
    URLs in commit 0729ee4 (the post-transfer URL refresh). This commit is
    purely the Go-tooling layer.
  Scarf pixels (`shankar0123.docker.scarf.sh/...`) — Scarf-account
    namespace, not a Go import or GitHub repo URL. Stays.

This is a non-blocking, non-customer-impacting change. Operators pulling
container images, running `make verify`, hitting the API, or installing the
agent see no functional difference. Only Go-tooling consumers (none today)
are affected, and they're enabled — not broken — by this commit.
2026-05-04 00:30:29 +00:00
shankar0123 8a56a78282 target(azurekv): SDK-driven Azure Key Vault target connector
Closes Rank 5 (Azure half) of the 2026-05-03 Infisical deep-research
deliverable (cowork/infisical-deep-research-results.md Part 5).
Pre-fix, certctl had no path to deploy certs to Azure-managed TLS-
termination endpoints (Application Gateway / Front Door / App Service
/ Container Apps) — operators terminating TLS at Azure had to use
manual `az keyvault certificate import` invocations or external
automation. This commit lands the SDK-driven Azure Key Vault target
connector that closes the gap, mirroring the AWS ACM target shape
shipped in commit edf6bee.

Architecture:
  - internal/connector/target/azurekv/azurekv.go — Connector wraps
    *azcertificates.Client behind the KeyVaultClient interface seam
    (mirrors awsacm's ACMClient + awsacmpca's ACMPCAClient). Lives
    in azurekv.go alongside the PFX (PKCS#12) wrapping helper that
    bundles the operator-supplied PEM cert + chain + key into the
    base64-PFX wire format azcertificates.ImportCertificate accepts.
  - internal/connector/target/azurekv/sdk_client.go — SDK-loading
    code isolated so the test path (NewWithClient) compiles without
    pulling azcore + azidentity transitive deps into the test
    binary. DefaultAzureCredential / ManagedIdentityCredential /
    EnvironmentCredential / WorkloadIdentityCredential selected via
    Config.CredentialMode (closed enum).
  - Pre-deploy snapshot via GetCertificate(name, "" /* latest */) so
    on-import-failure rollback restores the previous cert. Mirrors
    Bundle 5+. The Azure-specific quirk: rollback creates a NEW
    VERSION (Key Vault doesn't support version-restore without
    soft-delete recovery, which we keep off the minimum-RBAC
    surface). Operators reading audit dashboards see e.g. v1=initial,
    v2=failed-renewal, v3=rollback-of-v2; the certctl-managed-by +
    certctl-certificate-id provenance tags + future certctl-rollback-of
    metadata tag let an operator filter rollback artifacts.
  - Provenance tags identical to AWS ACM
    (certctl-managed-by=certctl + certctl-certificate-id=<mc-id>),
    automatically applied on every import. Key Vault carries tags
    forward across versions (unlike ACM which strips on re-import),
    so no separate AddTags call is required.
  - DeploymentRequest.KeyPEM held in agent memory only; PFX wrapping
    happens in-memory via software.sslmate.com/src/go-pkcs12. No
    disk write.

Tests:
  - azurekv_test.go: 13-subtest happy-path + validation matrix —
    ValidateConfig (success / missing-vault-url / malformed-vault-
    url / missing-cert-name / invalid-credential-mode / reserved-
    tag rejection), DeployCertificate (fresh import / rollback-on-
    serial-mismatch / empty-key-rejected / no-client-rejected /
    SDK-error-surfaced), ValidateOnly (returns sentinel),
    ValidateDeployment (serial match / mismatch).
  - All tests use the NewWithClient injection seam; no real-Azure
    API calls.
  - go test -short -count=1 ./internal/connector/target/azurekv/...
    green.

Wiring:
  - internal/domain/connector.go: TargetTypeAzureKeyVault =
    "AzureKeyVault".
  - internal/service/target.go: validTargetTypes set extended.
  - cmd/agent/main.go::createTargetConnector: AzureKeyVault case
    arm mirroring the AWSACM shape exactly.
  - cmd/agent/agent_test.go::TestCreateTargetConnector_AllSupported
    Types: AzureKeyVault added to the type matrix + the InvalidJSON
    matrix (16 supported target types now, up from 15).

go.mod / go.sum:
  - github.com/Azure/azure-sdk-for-go/sdk/azcore v1.20.0 (direct).
  - github.com/Azure/azure-sdk-for-go/sdk/azidentity v1.13.1 (direct).
  - github.com/Azure/azure-sdk-for-go/sdk/security/keyvault/
    azcertificates v1.4.0 (direct). The deprecated
    /keyvault/azcertificates path appears as a transitive indirect
    via Microsoft's microsoft-authentication-library-for-go; we use
    the new /security/keyvault/ path exclusively.

Documentation:
  - docs/connectors.md "Azure Key Vault" section: config table, RBAC
    role recipe (off-the-shelf "Key Vault Certificates Officer" or
    custom role with 3 data-plane actions), AKS workload-identity /
    managed-identity / service-principal / default credential
    recipes, atomic-rollback contract + Azure-version semantics
    explanation, soft-delete caveat, App Gateway / Front Door
    Terraform attachment snippet, threat model carve-outs (no disk
    writes, mandatory provenance tags, no long-lived secrets in
    Config), 5-bullet procurement checklist crib.

Out of scope (intentional, flagged in V3-Pro forward path):
  - Azure Front Door direct-attach (UpdateRoutingConfig — different
    Azure RBAC scope).
  - App Gateway / App Service auto-bind (V3-Pro auto-attach).
  - Soft-delete recovery (acm:RecoverDeletedCertificate-equivalent
    requires extra RBAC; V2 keeps minimum-permission surface).
  - GCP Certificate Manager (separate cloud, separate connector).

Verified locally:
- gofmt clean.
- go vet ./internal/connector/target/azurekv/...
  ./internal/domain/... ./internal/service/...
  ./cmd/agent/...  clean.
- go test -short -count=1 ./internal/connector/target/azurekv/...
  ./cmd/agent/...  green (all 16 supported target types
  instantiate via the agent factory).

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 5.
Acquisition prompt:
cowork/rank-5-aws-acm-azure-kv-target-adapters-prompt.md.
Companion commit (AWS half): edf6bee.
2026-05-03 22:43:45 +00:00
shankar0123 edf6bee7f8 target(awsacm): SDK-driven AWS Certificate Manager target connector
Closes Rank 5 (AWS half) of the 2026-05-03 Infisical deep-research
deliverable (cowork/infisical-deep-research-results.md Part 5).
Pre-fix, certctl had no path to deploy certs to AWS-managed TLS-
termination endpoints (ALB / CloudFront / API Gateway / App Runner)
— operators terminating TLS at AWS had to use Infisical secret-sync,
manual aws-cli imports, or external automation. This commit lands
the SDK-driven AWS Certificate Manager target connector that closes
the gap end-to-end.

Architecture:
  - internal/connector/target/awsacm/awsacm.go — Connector wraps
    *acm.Client behind the ACMClient interface seam (mirrors
    awsacmpca's ACMPCAClient pattern from the issuer side).
    LoadDefaultConfig handles the standard AWS credential chain
    (IRSA / EC2 instance profile / SSO / env vars); no embedded
    creds in connector Config.
  - Pre-deploy snapshot via DescribeCertificate + GetCertificate so
    on-import-failure rollback restores the previous cert. Mirrors
    the Bundle 5 IIS pattern + the Bundle 7/8 WinCertStore /
    JavaKeystore patterns. Surfaces rollback success/failure via
    the existing certctl_deploy_rollback_total Prometheus counter
    label set.
  - Provenance tags: certctl-managed-by=certctl + certctl-
    certificate-id=<mc-id> set automatically on every import. ACM
    strips tags on re-import, so the connector calls
    AddTagsToCertificate post-import to keep the provenance pair
    fresh. Operators looking up a cert ARN by managed-cert ID
    (Terraform data source, CloudFormation output) match against
    these tags.
  - DeploymentRequest.KeyPEM held in agent memory only — never
    written to disk. Aligns with the pull-only deployment model
    documented in CLAUDE.md.

Tests:
  - awsacm_test.go: 15-subtest happy-path + validation matrix
    covering ValidateConfig (success / missing-region / malformed-
    region / malformed-ARN / reserved-tag rejection),
    DeployCertificate (fresh import / rotate-in-place / rollback-
    on-serial-mismatch / rollback-also-fails / empty-key-rejected /
    no-client-rejected), ValidateOnly (returns sentinel),
    ValidateDeployment (serial match / mismatch / no-ARN-yet).
  - awsacm_failure_test.go: 5 per-error-class contract tests
    mirroring the awsacmpca_failure_test.go shape (commit
    a2a59a8) — AccessDeniedException (smithy.GenericAPIError),
    ResourceNotFoundException (typed), ThrottlingException
    (smithy.GenericAPIError, FaultServer preserved),
    InvalidArgsException (typed, terminal), RequestInProgress
    Exception (typed). All assert errors.As against the SDK type +
    operator-actionable substring + connector-side wrap framing.
  - Coverage on awsacm.go: 54.9% of statements (matches the K8s-
    Secret + IIS connectors' 50-65% range; rollback-failure paths
    contribute most of the un-covered surface — those exercise
    only when the rollback's SDK call also returns an error).
  - go test -race -count=10 green; no goroutine leaks.

Wiring:
  - internal/domain/connector.go: TargetTypeAWSACM = "AWSACM".
  - internal/service/target.go: validTargetTypes set extended.
  - cmd/agent/main.go::createTargetConnector: AWSACM case arm
    mirroring the KubernetesSecrets shape exactly. Calls
    awsacm.New(context.Background(), &cfg, a.logger) — the
    SDK-loading happens here, not lazily, so config errors
    surface at agent boot.
  - cmd/agent/agent_test.go::TestCreateTargetConnector_AllSupported
    Types: AWSACM added to the type matrix + the InvalidJSON
    matrix.

go.mod / go.sum:
  - github.com/aws/aws-sdk-go-v2/service/acm v1.38.3 (direct).
    aws-sdk-go-v2 + service/acmpca + smithy-go were already direct
    from the awsacmpca issuer; this is the distribution-side
    companion package.

Documentation:
  - docs/connectors.md "AWS Certificate Manager (ACM)" section:
    config table, IAM policy JSON (5 actions on
    arn:aws:acm:*:*:certificate/*), IRSA / EC2 instance-profile /
    SSO auth recipes, atomic-rollback contract, Terraform ALB-
    attachment snippet, threat model carve-outs (no disk writes,
    mandatory provenance tags, no long-lived creds in Config),
    procurement checklist crib (5 bullets paste-able into a
    security review).

Out of scope (intentional, flagged in V3-Pro forward path):
  - CloudFront / ALB auto-attach (UpdateDistribution requires a
    different IAM scope than ACM ImportCertificate).
  - Cross-region ACM replication (ACM is regional; CloudFront
    forces us-east-1).
  - Tag-filtered ARN discovery (V2 uses operator-pinned
    Config.CertificateArn after first deploy; tag-scan path
    requires acm:ListTagsForCertificate which we deliberately
    keep off the minimum-IAM-policy surface).
  - Azure Key Vault (separate cloud, separate connector — Azure
    half of Rank 5 ships in a follow-on commit).

Verified locally:
- gofmt clean.
- go vet ./internal/connector/target/awsacm/...
  ./internal/domain/... ./internal/service/...
  ./cmd/agent/...  clean.
- go test -short -count=1 ./internal/connector/target/awsacm/...
  ./internal/domain/... ./cmd/agent/...  green (15 + 5 awsacm
  subtests; all 15 supported target types instantiate via the
  agent factory).
- go test -race -count=10 ./internal/connector/target/awsacm/...
  green.

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 5.
Acquisition prompt:
cowork/rank-5-aws-acm-azure-kv-target-adapters-prompt.md.
2026-05-03 22:32:45 +00:00
shankar0123 109f32ff41 notifications: per-policy multi-channel expiry-alert routing
Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable
(see cowork/infisical-deep-research-results.md Part 5). Pre-fix,
RenewalService.CheckExpiringCertificates already ran daily,
RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and
NotificationService.SendThresholdAlert deduped per (cert, threshold)
— but the channel was hardcoded to Email
(internal/service/notification.go:118 pre-fix). Operators who
configured PagerDuty / Slack / Teams / OpsGenie via
CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold
unless SMTP was also wired. Their first signal of an expired cert
was a 3 AM outage.

This commit lands the routing matrix on top of the existing
infrastructure:

  1. RenewalPolicy gains AlertChannels (per-tier channel list) +
     AlertSeverityMap (per-threshold tier assignment) +
     EffectiveAlertChannels / EffectiveAlertSeverity accessors.
     Default*() helpers preserve the back-compat Email-only
     behaviour for operators who haven't touched their policies
     post-upgrade. Migration 000026 adds the JSONB columns
     idempotently.
  2. NotificationService.SendThresholdAlertOnChannel — the new
     per-channel dispatch helper. Old SendThresholdAlert stays as
     an Email-only alias so non-policy callers (admin "send test
     alert" surfaces) keep working byte-for-byte.
  3. NotificationService.HasThresholdNotificationOnChannel — per-
     (cert, threshold, channel) deduplication so a transient
     PagerDuty 5xx today does NOT suppress today's Slack alert and
     tomorrow's PagerDuty retry will still fire.
  4. RenewalService.sendThresholdAlerts walks the resolved channel
     set per threshold tier, fans out to every configured channel,
     handles per-channel failures independently, defensively drops
     off-enum channels with an audit row trail, and records a per-
     channel audit event with metadata.channel + metadata.severity_tier.
  5. service.ExpiryAlertMetrics — atomic counter table mirrored on
     the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5
     (commit 0792271). Three labels: channel × threshold × result
     (success / failure / deduped). Cardinality bound: 6 × 4 × 3 =
     72 series for the standard 4-threshold matrix.
  6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus
     exposer for certctl_expiry_alerts_total{channel,threshold,result}.
     Pre-sorted snapshot for byte-stable emission.
  7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics
     instance through both the recording side (notificationService.
     SetExpiryAlertMetrics) and the exposing side
     (metricsHandler.SetExpiryAlerts).

Dispatch flow (post-fix, per renewal-loop tick):

  cert ages past T-30  → daily renewal-loop fires
                       → policy lookup
                       → for each crossed threshold:
                           - resolve severity tier (informational/
                             warning/critical) via AlertSeverityMap
                           - look up channel set in AlertChannels[tier]
                           - for each channel: dedup → SendThresholdAlertOnChannel
                             → notifierRegistry[channel] → audit row →
                             Prometheus counter increment

Tests (internal/service/renewal_expiry_alerts_test.go):

  TestExpiryAlerts_DefaultMatrix_EmailOnly
  TestExpiryAlerts_PerTierFanOut
  TestExpiryAlerts_PerChannelDedup
  TestExpiryAlerts_OneChannelFails_OthersStillFire
  TestExpiryAlerts_OffEnumChannelDropped
  TestExpiryAlerts_MetricCounterIncrements
  TestExpiryAlerts_NilPolicy_FallsToDefault
  TestExpiryAlerts_OperatorOptOutOfTier

The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0
days through the canonical 4 thresholds with the matrix
{informational:[Slack], warning:[Slack,Email],
critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact
recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no
Teams, no Webhook. The OneChannelFails test pins that PagerDuty
returning a 503 does NOT skip Slack/Email at the same threshold.

Drive-by fix (internal/service/testutil_test.go): the existing
mockNotifRepo.List ignored its filter and returned all rows, which
let legacy tests pass on dedup-via-substring even though the
postgres repo actually applied the filter. Updated the mock to
honour CertificateID / Type / Status / Channel / MessageLike
filters in the same shape as the postgres implementation
(internal/repository/postgres/notification.go). All pre-existing
service tests still pass — the legacy test suite happened to be
robust to the mock filter doing nothing.

Documentation:
  - docs/connectors.md Notifier section gains "Routing expiry
    alerts across channels" — operator-facing, JSON example,
    procurement playbook ("How do I make sure PagerDuty pages on
    the T-1 alert?"), debug recipe via SQL on audit_events +
    notification_events + Prometheus.
  - docs/runbook-expiry-alerts.md — sysadmin-grade flowchart,
    per-policy channel-matrix configuration recipes, "did the on-
    call team get paged?" SQL queries, cardinality budget, V3-Pro
    forward path.
  - cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry
    alerts: per-owner routing" V3-Pro entry under Adapter
    hardening.

Out of scope (intentional, flagged in V3-Pro forward path):
  - Per-owner / per-team / per-tenant channel routing (matrix is
    per-policy today, not per-owner).
  - Calendar-aware suppression (no T-30 alerts on weekends).
  - Escalation chains (T-1 unanswered for 30m → escalate).
  - Per-channel rate limiting (downstream of I-005 retry+DLQ).

CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md
itself ("no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/domain/... ./internal/service/...
  ./internal/api/handler/... ./cmd/server/...  clean.
  (./internal/repository/postgres/... vet failed on transitive
  testcontainers/docker module download — sandbox disk pressure,
  not a code issue; postgres-repo build succeeds and tests pass.)
- go test -short -count=1 ./internal/domain/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestExpiryAlerts'
  ./internal/service/...  green (per-channel dedup race-free).

Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4.
Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.
2026-05-03 22:12:32 +00:00
shankar0123 022caf39b4 ci(googlecas): fix QF1002 staticcheck — tagged switch on r.URL.Path
CI failure on commit a2a59a8 (run #423):

  internal/connector/issuer/googlecas/googlecas_failure_test.go:189:3:
    QF1002: could use tagged switch on r.URL.Path (staticcheck)

The OAuth2 token-refresh test handler had two cases — `r.URL.Path ==
"/token"` and `default` — both equality-against-r.URL.Path. Stati-
ccheck's QF1002 rule wants this expressed as a tagged switch:

  switch r.URL.Path {
  case "/token":
      ...
  default:
      ...
  }

The other four switches in the same file are mixed equality + Contains
(`case r.URL.Path == "/token":` + `case strings.Contains(r.URL.Path,
"/certificates"):`) — those are not tag-able and stay on
`switch { case ... }`. Only the OAuth2 test handler had the single-
equality-case pattern QF1002 fires on.

Test-only commit. No production code change.

Verified locally:
- gofmt clean.
- go test -short -count=1 ./internal/connector/issuer/googlecas/...
  green (5 failure tests + 14 happy-path subtests + 4 stub tests).
2026-05-03 21:32:55 +00:00
shankar0123 869fc8f245 docs(openssl): operator playbook for shell-out threat model
Closes Top-10 fix #6 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
OpenSSL adapter's docs in docs/connectors.md explained usage but
did NOT enumerate the threat model. The adapter exec's an arbitrary
operator-supplied script — env-var inheritance, symlink attacks,
sandbox-escape, multi-tenant process-isolation gaps. An acquirer's
security reviewer reading this surface cold pattern-matches
"highest-risk issuer surface with the lowest documented threat
model."

This commit lands a doc-side operator playbook in
docs/connectors.md OpenSSL section (mirrors Bundle 8's "Operator
playbook: keytool argv password exposure" subsection shape and
the 2026-05-02 audit Top-10 fix #7 SSH InsecureIgnoreHostKey
playbook). Six topics covered:

  1. Why the adapter exists despite the risk (CLI-driven CAs
     without Go SDKs need an integration path).
  2. Threat model the adapter accepts (trusted operator + trusted
     script + appropriate ownership + clear audit trail).
  3. Threat model the adapter does NOT accept (operator-writable
     script paths, untrusted content, multi-tenant hosts).
  4. Mitigations operators can layer (dedicated user, root-owned
     0755 binary, audit rules, per-call timeout via
     CERTCTL_OPENSSL_TIMEOUT_SECONDS, env sanitisation,
     chroot/container, audit wrapper, per-call concurrency
     bound).
  5. When NOT to use the adapter (compliance environments,
     multi-tenant servers, no-script-review environments).
  6. V3-Pro forward path (hardened mode tracked in
     cowork/WORKSPACE-ROADMAP.md).

Inline comment in internal/connector/issuer/openssl/openssl.go
near the callSignScript exec call site forward-references the
new doc subsection (no logic change).

cowork/WORKSPACE-ROADMAP.md gains an "OpenSSL hardened mode" V3-
Pro entry under "Adapter hardening" — sibling-folder doc, not in
the certctl repo, so not reflected in this commit's diff.

Same shape Bundle 8 used for the JavaKeystore playbook and the
2026-05-02 deployment-target audit Top-10 fix #7 used for the SSH
InsecureIgnoreHostKey playbook.

No code logic changes (only the explanatory comment near the
exec call site). No test changes. Doc-only commit.

Verified locally:
- gofmt / go vet clean.
- go test -short -count=1 ./internal/connector/issuer/openssl/...
  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #6.
2026-05-03 21:28:05 +00:00
shankar0123 0792271dc6 vault: add automatic token renewal at TTL/2 + Prometheus metric
Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
VaultPKI adapter authenticated with a static token and never called
renew-self. Long-lived deploys hit token expiry; the first
operator-visible signal was failed cert renewals on production
targets.

This commit:

  1. Connector.Start(ctx) spawns a goroutine that calls
     POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a
     one-shot lookup-self at startup). Honours ctx.Done() for
     graceful shutdown via a per-loop done channel + Stop().
  2. On `renewable: false` response (initial lookup OR any subsequent
     renewal), the loop emits a WARN, increments the not_renewable
     counter, and exits. The operator must rotate the token before
     Vault's Max TTL elapses.
  3. New Prometheus counter certctl_vault_token_renewals_total with
     labels result={success,failure,not_renewable}. Registered
     alongside existing certctl_issuance_* counters in
     internal/api/handler/metrics.go.
  4. ERROR-level logging on renewal failure with operator-actionable
     substring ("vault token renewal failed; rotate the token before
     TTL expires") so journalctl + grep find it. Loop keeps ticking
     after a failure — transient blips don't kill it.

New optional issuer.Lifecycle interface:

  type Lifecycle interface {
      Start(ctx context.Context) error
      Stop()
  }

Connectors that hold no background goroutines (almost all of them)
do not implement this — IssuerRegistry.StartLifecycles /
StopLifecycles feature-detect via type assertion. New
lifecycle-bearing connectors plug in by implementing the interface;
no further registry plumbing required.

Wiring (cmd/server/main.go):

  - service.NewVaultRenewalMetrics() instance is shared between
    issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built
    by Rebuild get a recorder) and metricsHandler.SetVaultRenewals
    (so the Prometheus exposer emits the new series).
  - issuerRegistry.StartLifecycles(ctx) is called after
    issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles
    is paired so goroutines exit cleanly on signal.
  - IssuerConnectorAdapter.Underlying() exposes the wrapped
    issuer.Connector so registry-level machinery can reach the
    concrete connector behind the adapter without duplicating the
    wiring at every call site.

Tests (internal/connector/issuer/vault/vault_renew_test.go):

  - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three
    renewals, all "success".
  - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns
    renewable=false, loop exits, third tick fires no HTTP call.
  - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403
    bumps "failure", second renewal succeeds → loop kept ticking.
  - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns
    within 200ms after ctx cancel.
  - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token
    already non-renewable at boot ⇒ no goroutine, "not_renewable"
    metric increments at startup so operators see it in Grafana.
  - TestVault_ComputeInterval — 4 cases pinning TTL/2 +
    minRenewInterval floor.
  - TestVault_RenewSelf_ParseFailure_NamesActionableInError —
    surfaced error contains "vault token renewal failed" + "rotate
    the token".

Cadence is dynamic — every successful renewal re-derives TTL/2
from the renewed lease's lease_duration, so a short bootstrap
token that gets renewed up to a longer Max TTL shifts to the
longer cadence automatically (defends against degenerate fast
ticking on a token whose Max TTL is far longer than its initial
TTL).

Documentation:
  - docs/connectors.md Vault PKI section gains "Token TTL +
    automatic renewal" subsection (operator-facing: cadence, metric,
    renewable=false rotation playbook).

Out of scope (intentional, flagged in the audit follow-up):
  - AppRole / Kubernetes / AWS IAM auth methods (different renewal
    semantics).
  - Hot-reload of rotated token from disk (operator restarts
    today; future: GUI/MCP issuer-update path triggers Rebuild
    which Stops the old connector and Starts the new one).
  - Auto-re-auth after token death (operator playbook owns it).

CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md
itself: "no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/service/... ./internal/api/handler/...
  ./internal/connector/issuer/vault/... ./cmd/server/...  clean.
- go test -short -count=1 ./internal/connector/issuer/vault/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval'
  ./internal/connector/issuer/vault/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #5.
2026-05-03 21:24:27 +00:00
shankar0123 a2a59a823e googlecas, awsacmpca: add failure_test.go covering cloud-SDK error contracts
Closes Top-10 fix #4 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, both
adapters had only happy-path test coverage with a single generic
ServerError pair each. Cloud CAs are typically the first-deployed
issuer in enterprise pilots; their diligence reviews dig hard into
IAM-error / cloud-error coverage. This commit lands the contract
tests.

AWSACMPCA — 5 tests in awsacmpca_failure_test.go. Each injects a
typed AWS SDK v2 error via the existing mockACMPCAClient seam and
asserts (1) error non-nil, (2) errors.As against the SDK's typed
value succeeds (so the wrap chain through fmt.Errorf("...%w", ...)
is intact), and (3) operator-actionable substring is present.

  1. Issue_AccessDenied — *smithy.GenericAPIError with
     Code="AccessDeniedException" (the SDK does NOT generate a
     typed *types.AccessDeniedException; AWS uses the smithy
     APIError shape for IAM denials). Asserts ErrorCode +
     "not authorized" + IAM resource path preserved through wrap.
  2. Issue_ResourceNotFound — *types.ResourceNotFoundException
     names the missing CA ARN.
  3. Issue_Throttling — *smithy.GenericAPIError with
     Code="ThrottlingException", Fault=FaultServer. Asserts the
     retryable class (FaultServer) is preserved through wrap so
     upstream retry logic can engage.
  4. Issue_MalformedCSR — *types.MalformedCSRException is terminal
     (operator must fix the CSR, not retry); asserts the
     validation-issue substring survives.
  5. Issue_RequestInProgress — *types.RequestInProgressException
     wraps cleanly; classification (retry vs reissue) is upstream's
     responsibility per the spec's "no new retry logic" rule.

GoogleCAS — 5 tests in googlecas_failure_test.go. The adapter uses
stdlib net/http directly (NO Google Cloud Go SDK dependency in
googlecas.go), so SDK typed-error assertions don't translate. Each
test runs an httptest.Server that returns the canonical Google API
JSON error envelope:

  {"error":{"code":N,"message":"...","status":"<STATUS>"}}

and asserts (1) error non-nil, (2) operator-actionable substring,
and (3) the canonical status string ("PERMISSION_DENIED",
"NOT_FOUND", "UNAVAILABLE") survives the wrap chain so upstream
classification can branch on it.

  1. Issue_PermissionDenied — 403 / PERMISSION_DENIED; surfaced
     error names the IAM resource path.
  2. Issue_CAPoolNotFound — 404 / NOT_FOUND; surfaced error names
     the missing pool resource.
  3. Issue_OAuth2TokenRefreshFailure — token endpoint returns 401
     invalid_grant; surfaced error mentions "token" so an operator
     reading the log immediately distinguishes a credential failure
     (rotate SA key) from a CA-side error (fix IAM binding). Test
     also asserts the CAS endpoint is NOT reached when the token
     exchange fails.
  4. Issue_RegionalAPIUnavailable — 503 / UNAVAILABLE; surfaced
     error preserves the retryable class markers (status code +
     UNAVAILABLE string) for upstream retry classification.
  5. Revoke_PermissionDenied — adapter does NOT silently swallow
     the failure; pin the contract so the audit-row atomicity
     guarantee from Bundle G (which lives in the service-layer
     wrapper, not the adapter) continues to apply. Test also
     verifies the revoke endpoint was actually reached, guarding
     against a future regression that short-circuits before the
     HTTP call.

Coverage delta:
  awsacmpca: 71.0% → 71.0% (failure tests reuse existing wrap
    code paths; behaviour-pin contract tests, not coverage tests).
  googlecas: 83.4% → 84.4% (+1.0pp).

go.mod: smithy-go moved indirect → direct, since the new AWSACMPCA
test file imports it. CI's go-mod-tidy-drift gate enforces this.

Test-only commit. No production code changes.

Verified locally:
  - gofmt clean.
  - go vet ./internal/connector/issuer/awsacmpca/...
    ./internal/connector/issuer/googlecas/...  clean.
  - go test -short -count=1 ./internal/connector/issuer/...  green.
  - go test -race -count=10 ./internal/connector/issuer/awsacmpca
    ./internal/connector/issuer/googlecas  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #4.
2026-05-03 21:10:41 +00:00
shankar0123 b0c4ed1ae2 openssl: add failure_test.go covering 6 shell-out error modes
Closes Top-10 fix #3 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
OpenSSL adapter (497 LOC, certctl's highest-risk issuer surface)
had openssl_test.go (8 happy-path funcs + 20 subtests) but no
dedicated _failure_test.go. Compare to ACME, Vault, DigiCert,
Sectigo, Entrust, GlobalSign, EJBCA — all peers have one. An
acquirer's diligence team flags this as an immediate blocker on
the highest-risk issuer surface.

This commit adds 6 failure-mode tests:

  1. TestOpenSSL_Issue_ScriptNotFound_OperatorActionableError —
     SignScript path doesn't exist; error wraps os.ErrNotExist
     (errors.Is); message contains 'no such file' / 'not found'
     so the operator's grep finds it in journalctl.
  2. TestOpenSSL_Issue_PermissionDenied_OperatorActionableError —
     SignScript exists with mode 0o600 (non-executable); error
     wraps os.ErrPermission; message contains 'permission'.
     Skipped under root (uid 0 bypasses chmod gating).
  3. TestOpenSSL_Issue_MalformedStdout_DistinguishedFromCSRReject
     — script exits 0 + writes garbage (no PEM markers) to the
     cert output file; error mentions PEM/certificate/parse so
     operators distinguish output-parsing failure from a script-
     side fault.
  4. TestOpenSSL_Issue_NonZeroExit_DistinguishesCAReject_From_
     ScriptError — script writes 'policy violation: …' to stderr
     and exits 2 (CA-side rejection convention); the script's
     stderr surfaces in the error message; errors.Unwrap returns
     non-nil (proving the underlying *exec.ExitError chain
     survives).
  5. TestOpenSSL_Issue_TimeoutEnforced_ContextCancellationPropagates
     — script does 'exec sleep 30' (not 'sleep 30 ' as a child;
     exec replaces bash so SIGKILL goes directly to the sleeper,
     avoiding the orphan-pipes corner case where a killed bash
     leaves sleep holding stdout/stderr open and CombinedOutput
     blocks); ctx with 100ms deadline; call returns within ~5s
     wall-clock; either errors.Is(err, context.DeadlineExceeded)
     or the error message names 'killed' / 'signal'.
  6. TestOpenSSL_Issue_SignalKilled_PartialOutputDiscarded —
     script writes a half-PEM ('-----BEGIN CERTIFICATE-----\nMII…')
     then 'kill -KILL $$'; assertion: result is nil OR
     CertPEM is empty (no half-cert leaks to caller); error
     names 'signal' / 'killed' OR 'PEM' / 'parse' (both are
     operator-actionable).

Each test pins the operator-actionable error message contract:
the message names the failure mode (so journalctl + grep find
it) and proves no half-state was created (no partial cert
returned). errors.Is / errors.Unwrap checks confirm the wrapping
chain survives.

The OpenSSL adapter has no commandRunner abstraction (production
code uses exec.CommandContext directly); these tests use real
operator-supplied scripts written to t.TempDir (matches the
adapter's actual production code path; no os/exec mocking). The
'exec sleep 30' technique in Test 5 is the load-bearing fix for
the bash-orphans-sleep-and-pipes-stay-open corner case that
otherwise makes the test take 30s instead of 100ms.

Coverage delta:
  - Before this commit: openssl_test.go + openssl_stubs_test.go
    covered 8 happy-path funcs.
  - After: 79.8% statement coverage of openssl.go (up from
    operator-pre-existing baseline; the 6 new tests exercise
    every error path through callSignScript + parseCertificate).

Tests pass clean under '-race -count=10' (Test 5's deadline
tolerance is the only timing-sensitive case; the 5s wall-clock
budget vs the 100ms ctx deadline gives ample slack on slow CI
without masking deadline-not-enforced bugs).

Test-only commit; no production code changes. Hardening fixes
(per-call concurrency semaphore, threat-model docs) are separate
Top-10 entries.

Verified locally:
  - gofmt clean across the repo.
  - go vet ./... clean across the repo.
  - go test -race -count=10 -short
    ./internal/connector/issuer/openssl/... green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/
RESULTS.md Top-10 fix #3.
2026-05-03 20:55:44 +00:00
shankar0123 d3bf2cc0cf vault, digicert: migrate Token / APIKey to *secret.Ref (Bundle I Phase 3)
Closes Top-10 fix #2 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix,
vault.Config.Token and digicert.Config.APIKey were plain string
fields. Practical impact:

  1. GET /api/v1/issuers responses marshalled the credential into
     the JSON body. An acquirer's procurement engineer running
     'curl /api/v1/issuers | jq' saw the token / API key in plain
     text on screen.
  2. DEBUG-level HTTP request logging printed the credential
     header verbatim.
  3. A heap dump of the running server contained the credential
     as readable bytes for the lifetime of the process.

Bundle I from the 2026-05-01 audit closed this for AWSACMPCA,
EJBCA, GlobalSign, Sectigo (Phase 1+2). Vault and DigiCert were
left out. This commit ports the same migration onto them.

Mechanics:
  - Config.Token / Config.APIKey type changed from 'string' to
    '*secret.Ref'. UnmarshalJSON of a JSON string populates the
    Ref via NewRefFromString — operator config files are
    unchanged.
  - Every header-write call site routed through Ref.Use, with the
    byte buffer zeroed after the callback returns. Vault: 3 sites
    (IssueCertificate, RevokeCertificate, GetCACertPEM). DigiCert:
    5 sites (ValidateConfig, IssueCertificate, RevokeCertificate,
    pollOrderOnce, downloadCertificate).
  - ValidateConfig nil-checks switch from 'cfg.Token == ""' to
    'cfg.Token.IsEmpty()' (mirrors Sectigo's existing pattern).
  - Tests migrated: every Config{Token:"..."} →
    Config{Token: secret.NewRefFromString("...")}. The
    'json.Marshal(config) → ValidateConfig(rawConfig)' round-trip
    pattern in DigiCert's ValidateConfig_Success test is now
    broken by the redact-on-marshal contract — switched that one
    to construct the rawConfig as a JSON literal (mirrors
    Sectigo's existing test pattern).
  - Two new tests pin the redact-on-marshal contract:
      - TestVault_Config_TokenMarshalsAsRedacted (vault_redact_test.go)
      - TestDigiCert_Config_APIKeyMarshalsAsRedacted (digicert_redact_test.go)
    Both assert the marshaled JSON contains '"[redacted]"' and
    does NOT contain the plaintext bytes.

Operator-visible: GET /api/v1/issuers responses for type=vault
and type=digicert now show the credential as '[redacted]'.
Existing config files keep working — the Ref unmarshal accepts
strings.

CHANGELOG note: certctl/CHANGELOG.md is intentionally not
hand-edited; release notes are auto-generated from commit
messages between consecutive tags. This commit's message body is
the release-note artifact.

Verified locally:
  - gofmt clean across the repo.
  - go vet ./... clean across the repo.
  - go test -race -count=1 -short
    ./internal/connector/issuer/vault/...
    ./internal/connector/issuer/digicert/...
    ./internal/secret/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/
RESULTS.md Top-10 fix #2.
2026-05-03 20:49:23 +00:00
shankar0123 81f6321326 ejbca: port mTLS keypair to mtlscache (close Bundle M for the last issuer)
Closes Top-10 fix #1 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix,
ejbca.go::New called tls.LoadX509KeyPair once at construction and
configured the keypair into *http.Transport.TLSClientConfig with
no mtime watch. mTLS rotation required a server restart — quarterly
rotation per any reasonable security policy = quarterly deploy
outage.

Bundle M from the prior 2026-05-01 audit shipped the mtlscache
helper at internal/connector/issuer/mtlscache/cache.go and wired
it into Entrust + GlobalSign. EJBCA was missed in Bundle M's
scope. This commit ports the same helper onto EJBCA's
auth_mode=mtls path. The OAuth2 path is unchanged.

Implementation:
  - New imports internal/connector/issuer/mtlscache.
  - Connector struct gains an mtls *mtlscache.Cache field
    (mirroring Entrust + GlobalSign).
  - New()'s case 'mtls': replaces tls.LoadX509KeyPair + manual
    *http.Transport with mtlscache.New(certPath, keyPath,
    Options{HTTPTimeout: 30s}). Cache build happens at construction
    so misconfigured operators fail fast (matches pre-fix
    behaviour).
  - New helper getHTTPClient() returns the cached client; on the
    mTLS path it calls RefreshIfStale before returning so the
    next request uses the new keypair if disk has rotated. On
    OAuth2 / test paths (c.mtls == nil), returns c.httpClient
    as-is.
  - All 3 c.httpClient.Do call sites (IssueCertificate enroll,
    RevokeCertificate revoke, GetOrderStatus cert lookup) replaced
    with c.getHTTPClient() + client.Do.
  - crypto/tls import removed (no longer used at this layer).

Tests:
  - TestEJBCA_MTLSKeypairRotation_PicksUpNewCertWithoutRestart
    (new, ejbca_mtls_rotation_test.go): generates two CAs (caA,
    caB), signs leafA + leafB, spins up an httptest TLS server
    that trusts both CAs and records the issuer DN of every
    presented client cert, writes leafA, makes request 1, writes
    leafB + advances mtime by 2s, makes request 2. Asserts the
    server saw caA's DN on req 1 and caB's DN on req 2 — the
    cache picked up the rotation without ejbca.New re-running.
  - export_test.go: GetHTTPClientForTest helper exposes the
    private getHTTPClient so the rotation test drives the
    production code path.
  - All existing EJBCA tests still pass (TestNew_MTLSWiresClientCert,
    TestNew_MTLSCertLoadFailure, TestNew_OAuth2NoTransportTuning,
    TestNew_InvalidAuthMode).

Verified locally:
  - gofmt clean across the repo.
  - go vet ./... clean across the repo.
  - go test -race -count=1 -short ./internal/connector/issuer/ejbca/...
    ./internal/connector/issuer/mtlscache/... green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/
RESULTS.md Top-10 fix #1.
2026-05-03 20:38:19 +00:00
shankar0123 bee47f0318 acme-server: cert-manager integration test + production hardening (Phase 5/7)
Closes the production-readiness loop on the ACME surface. After this
commit, certctl ships per-account rate limits + a GC sweeper for
expired ACME state + a kind-driven cert-manager 1.15 integration test
+ a lego-driven RFC conformance harness + a k6 loadtest scenario for
the unauthenticated ACME path.

Architecture:
  - Rate limits live in-memory + per-replica. Restart wipes the
    counters; orders/hour caps are eventual-consistency anyway. A
    3-replica certctl-server fleet behind an LB effectively has 3x
    the configured throughput per account; persistent rate limiting
    is a follow-up if production telemetry shows abuse patterns we
    can't catch in a single restart cycle. Per-key + per-action
    isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and
    ActionChallengeRespond/<challenge-id> are independent buckets.
  - GC loop follows the existing scheduler-loop pattern (atomic.Bool
    + sync.WaitGroup; see crlGenerationLoop for shape). Three
    independent SQL sweeps per tick (DELETE expired nonces; UPDATE
    pending authzs whose expires_at < now() to expired; UPDATE
    pending/ready/processing orders whose expires_at < now() to
    invalid). Each sweep is a single statement; failures are logged-
    and-continued so a failing nonces sweep doesn't block authzs.
    Per-sweep 1m timeout bounds a stuck Postgres.
  - cert-manager integration test is gated on KIND_AVAILABLE so CI
    skips it cleanly (kind is too heavy for per-PR). Operators run
    locally via 'make acme-cert-manager-test'; the harness brings up
    a fresh cluster each run + tears it down on Cleanup.
  - lego conformance harness drives a real ACME client through
    register → run → cert-PEM-landed against a hermetic certctl
    stack. Catches RFC-shape regressions third-party clients would
    hit before they ship.
  - k6 ACME-flow scenario hammers the unauthenticated surface
    (directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS-
    signed flows are out of scope for k6 (no JWS support); they're
    covered by the lego harness above.

What ships:
  - internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases —
    disable-when-perHour-zero, capacity, per-key isolation, per-
    action isolation, refill-over-time, RetryAfter, concurrent-access
    with -race + 200 goroutines × 200 calls).
  - internal/repository/postgres/acme.go: 4 new methods —
    CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations
    + GCInvalidateExpiredOrders. Each a single SQL statement.
  - internal/service/acme.go: SetRateLimiter + GarbageCollect +
    rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey
    + RespondToChallenge) + concurrent-orders gate at CreateOrder.
    2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded);
    5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped /
    gc_authzs_expired / gc_orders_invalidated).
  - internal/scheduler/scheduler.go: ACMEGarbageCollector interface +
    acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME-
    GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the
    crlGenerationLoop shape.
  - internal/api/handler/acme.go: writeServiceError gains rateLimited
    (429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings.
  - internal/config/config.go: 5 new env vars
    (CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60,
    CERTCTL_ACME_SERVER_GC_INTERVAL=1m).
  - cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at
    startup; conditional SetACMEGarbageCollector(acmeService) +
    SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+
    GCInterval > 0.
  - deploy/test/acme-integration/: kind-config.yaml + cert-manager-
    install.sh + clusterissuer-trust-authenticated.yaml +
    clusterissuer-challenge.yaml + certificate-test.yaml + conformance-
    lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE
    gate).
  - deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section.
  - Makefile: 2 new PHONY targets (acme-cert-manager-test +
    acme-rfc-conformance-test).
  - docs/acme-server.md: status flipped to Phase 5; Configuration
    table grows 5 rows; new 'Phase 5 — operational guidance' section
    explaining rate-limit math + GC sweeper semantics + cert-manager
    integration + lego conformance + k6 baseline.

Tests:
  - 'go vet ./...' clean across the repo.
  - 'go test -short -count=1 ./internal/...' green across every
    affected package (service / acme / handler / scheduler / repo /
    config).
  - 'go vet -tags=integration ./deploy/test/acme-integration/' clean
    (the integration test compiles cleanly with the build tag).
  - The kind/cert-manager harness is gated behind KIND_AVAILABLE so
    CI skips by default; operators run locally via 'make acme-cert-
    manager-test'.

Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
2026-05-03 19:42:03 +00:00
shankar0123 650f5a198f fix: collapse identical if/else branches in Account handler (CodeQL #25)
CodeQL alert #25 (go/duplicate-branches) on internal/api/handler/
acme.go::ACMEHandler.Account flagged that 'if readOnly { ... } else
{ ... }' had byte-identical bodies — both setting the same
Content-Type: application/json header. The 'readOnly' bool was
threaded through the function as a placeholder for differentiated
headers (Cache-Control etc. on the POST-as-GET path) that never
landed; both branches collapsed to the same value with no
follow-through.

Audit + fix:
  - The alert is real (verified by re-reading the source); not a
    false positive.
  - The Copilot Autofix Anthropic surfaced was correct in spirit but
    incomplete: it collapsed the if/else but left 'readOnly' as
    dead code (declared at line 395, assigned at lines 400 and 436,
    only read at the now-removed if). golangci-lint's 'unused'
    linter would flag 'readOnly' next.
  - Complete fix: collapse the if/else AND remove the now-unused
    'readOnly' variable + its 2 assignments. Single unconditional
    'w.Header().Set("Content-Type", "application/json")' covers
    both paths (RFC 8555 §6.3 POST-as-GET + §7.3.2 / §7.3.6 update
    + deactivation all return the same account JSON shape — no spec
    rationale for differentiating headers).

Verified locally: 'gofmt -l .' clean; 'go vet ./...' clean;
'go test -short -count=1 ./internal/api/handler/' green; 'grep
readOnly' on the file returns only the new explanatory comment
(no live references).

The alert was first detected in commit 44a85d6 (Phase 1b) — the
duplicate has been sitting in the codebase since the Account
handler shipped. No functional regression for any RFC 8555 client
(cert-manager, lego, Posh-ACME): same status code, same headers,
same body.
2026-05-03 19:07:21 +00:00
shankar0123 1e1bc9b3b4 ci: fix Phase 4 post-push unused-symbol failures
CI on commit f6ba563 (Phase 4 gofmt fix) failed golangci-lint's
'unused' linter on internal/service/acme_phase4_test.go: the
stubRenewalPolicies type + its Get method were defined for a future
RenewalInfo happy-path test that I never actually wrote — only the
disabled + bad-cert-id negatives. The dead-code carried forward
because go vet doesn't catch unused-but-exported-shape, and the
package-private use never materialized.

Fix: delete the stubRenewalPolicies type + its method + the
adjacent stub-comment that referenced a similarly-imagined
stubIssuerConn that was never written either. The tests I have
(RotateAccountKey happy + duplicate, RevokeCert kid + jwk paths +
already-revoked + reason-clamping, RenewalInfo disabled +
bad-cert-id) all still pass — they don't reference the removed
type. The window-math is exercised directly in
internal/api/acme/phase4_test.go::TestComputeRenewalWindow_*; the
service-layer policy-lookup wiring is read at handler smoke time
in Phase 5.

Confirmed: 'gofmt -l .' clean; 'go vet ./internal/service/' clean;
'go test -short -count=1 ./internal/service/' green. Pre-commit
verification gate updated implicitly: future Phase commits should
spot-check unused-shape via grep against the test file (every
stub* helper should have ≥3 references, matching the live
helpers' usage profile).
2026-05-03 19:02:44 +00:00
shankar0123 f6ba5634fd ci: fix Phase 4 post-push gofmt failure (map-literal alignment)
CI on commit 4dc8d3f (Phase 4) failed gofmt on
internal/api/router/openapi_parity_test.go. The 6 new SpecParity-
Exceptions entries I added for the Phase 4 routes had over-padded
whitespace between key and value; the longest new key is
'"GET /acme/profile/{id}/renewal-info/{cert_id}":' which sets the
gofmt-canonical column width for the surrounding block, but my
hand-aligned values used the wider Phase-2 column width (set by the
even-longer 'POST /acme/profile/{id}/order/{ord_id}/finalize' key in
that block).

gofmt aligns map-literal columns per contiguous run between blank
lines / structural breaks, not file-globally. The Phase 4 entries
form their own run because they're separated from the Phase 2 block
by the '// Phase 4 — key rollover + revocation + ARI.' comment.

Fix: 'gofmt -w' on the file, which rewrote the 6 lines with the
correct (narrower) intra-block alignment. No semantic change — just
whitespace.

Confirmed: 'gofmt -l .' clean; 'go vet ./internal/api/router/' clean
(the test still passes after the formatting change).
2026-05-03 18:58:00 +00:00
shankar0123 4dc8d3fa5b acme-server: key rollover + revocation + ARI (Phase 4/7)
Closes the RFC 8555 + RFC 9773 surface beyond the issuance happy-path:
  - POST /acme/profile/<id>/key-change   (RFC 8555 §7.3.5)
  - POST /acme/profile/<id>/revoke-cert  (RFC 8555 §7.6)
  - GET  /acme/profile/<id>/renewal-info/<cert-id>  (RFC 9773 ARI)

After this commit, ACME clients can rotate account keys, revoke certs
through the ACME surface (rather than only via the certctl GUI/API),
and fetch ARI for proactive renewal scheduling.

Architecture:
  - Key rollover: outer JWS verified against the registered account key
    (existing kid path); the inner JWS — embedded as the outer's payload
    — verified against the embedded NEW jwk in a new dedicated routine
    (ParseAndVerifyKeyChangeInner) that enforces RFC 8555 §7.3.5
    inner-only invariants: MUST use jwk + MUST NOT use kid, payload
    .account == outer.kid, payload.oldKey thumbprint-equals registered.
    A single WithinTx swaps the stored thumbprint+pem and writes the
    audit row. Concurrent-rollover safety via SELECT…FOR UPDATE on the
    conflicting account row in UpdateAccountJWKWithTx; the loser
    observes the winner's new thumbprint and is told to retry (409).
  - Revocation: two auth paths. kid → AccountOwnsCertificate single-
    indexed COUNT lookup over acme_orders. jwk → constant-time RFC 7638
    thumbprint compare against the cert's pubkey. Both paths route
    through service.RevocationSvc.RevokeCertificateWithActor so the
    existing CRL/OCSP refresh + audit + metrics pipeline applies. RFC
    5280 §5.3.1 numeric reason codes clamp to certctl's
    domain.ValidRevocationReasons; codes 8 (removeFromCRL) + 10
    (aACompromise) clamp to 'unspecified' since they aren't in the set.
  - ARI is GET-only and unauth per RFC 9773 §4. Cert-id wire shape is
    base64url(AKI).base64url(serial); ParseARICertID strict-decodes,
    SerialHex emits the canonical certctl-shape lowercase-no-leading-
    zeros hex used in certificate_versions.serial_number.
    ComputeRenewalWindow has 3 branches: bound RenewalPolicy →
    [notAfter - days, notAfter - days/2]; no policy → last 33% of
    validity; past expiry → [now, now + 1d] (renew immediately).
    Retry-After honors CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL.

What ships:
  - internal/api/acme/{keychange,ari}.go (+ phase4_test.go: 15 tests).
  - internal/api/acme/order.go: RevokeCertRequest wire shape.
  - internal/api/handler/acme.go: KeyChange, RevokeCert, RenewalInfo
    + 11 new writeServiceError mappings.
  - internal/repository/postgres/acme.go: UpdateAccountJWKWithTx (FOR
    UPDATE + expectedOldThumbprint precondition; ErrACMEAccountKey-
    ConcurrentUpdate sentinel) + AccountOwnsCertificate.
  - internal/service/acme.go: RotateAccountKey + RevokeCert +
    RenewalInfo; CertificateRevoker + RenewalPolicyLookup interfaces;
    SetRevocationDelegate + SetRenewalPolicyLookup wiring; 11 new
    sentinels; 6 new metrics.
  - internal/service/acme_phase4_test.go: service-layer tests for
    RotateAccountKey (happy + duplicate-key) + RevokeCert (kid mismatch
    + jwk mismatch + jwk happy + already-revoked + reason-clamping) +
    RenewalInfo (disabled + bad cert-id).
  - internal/api/router/router.go: 6 new register calls (3 per-profile
    + 3 shorthand). Router parity exceptions extended in lockstep
    (in-tree SpecParityExceptions + CI-only openapi-handler-exceptions
    .yaml).
  - cmd/server/main.go: SetRevocationDelegate(revocationSvc) +
    SetRenewalPolicyLookup(renewalPolicyRepo) at startup.
  - internal/config/config.go: CERTCTL_ACME_SERVER_ARI_ENABLED (default
    true) + CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL (default 6h);
    BuildDirectory's ariEnabled flag now flips on under
    cfg.ARIEnabled.
  - docs/acme-server.md: phase status flipped to Phase 4; endpoints
    table grows 6 rows (3 per-profile + 3 shorthand); FAQ section
    appended explaining how to rotate keys, revoke certs, and consume
    ARI.

Tests:
  - 'go vet ./...' clean across the repo.
  - 'go test -short -count=1 ./...' green across every package.
  - phase4_test.go covers: keychange happy-path + 5 negatives +
    MapKeyChangeErrorToProblem coverage; ARI cert-id round-trip + 6
    malformed cases + BuildARICertID from a generated cert; window-
    math 3 branches.
  - service-layer tests confirm: RotateAccountKey atomically swaps the
    thumbprint (verifies persisted state) and rejects duplicate keys;
    RevokeCert routes through the stub RevocationSvc with the right
    actor string + reason on the jwk path, rejects mismatched keys,
    rejects already-revoked certs, clamps reason codes correctly;
    RenewalInfo respects ARIEnabled + cert-id format.

Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-4'.
2026-05-03 16:51:06 +00:00
shankar0123 62513ad12f ci: fix Phase 3 post-push CI failures (contextcheck + ST1021)
CI on commit 9bc8453 (Phase 3 challenges) failed three lint checks under
golangci-lint. Two were contextcheck on internal/service/acme.go
RespondToChallenge, where the validator-pool dispatch deliberately
detached from the request ctx via 'context.Background()' so the async
WithinTx survives the HTTP handler returning. contextcheck rightly
flagged the non-inherited context — the canonical Go 1.21+ answer for
this exact pattern is context.WithoutCancel(ctx), which preserves
inherited values (logger, trace IDs, audit actor) but detaches
cancellation. Swapping that in clears both contextcheck hits.

The third was ST1021 on internal/api/acme/validators.go: a comment
intended for the (*Pool).Snapshot() method had landed above the
PoolSnapshot type by accident. Split the comment — one prose line
for the type, one for the method — so each exported symbol carries
its own properly-anchored doc.

Confirmed local 'go vet' clean and 'go test -short -count=1' green
across internal/service/ and internal/api/acme/ before commit.
2026-05-03 15:56:03 +00:00
shankar0123 9bc845304e acme-server: HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (Phase 3/7)
Wires up the actual challenge-validation machinery so profiles in
acme_auth_mode='challenge' resolve end-to-end. After this commit,
cert-manager 1.15+ with `solver: http01: ingress` against a
challenge-mode profile completes a real HTTP-01 flow and gets a cert.
DNS-01 + TLS-ALPN-01 share the same code path with the appropriate
validator selection.

Architecture (the load-bearing parts):
  - 3 separate semaphore-bounded worker pools (one per challenge type),
    so HTTP-01 and DNS-01 can't starve each other under load. Default
    weight 10 per type; tunable via CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY,
    DNS01_CONCURRENCY, TLSALPN01_CONCURRENCY.
  - 30s per-challenge timeout (configurable via PoolConfig.PerChallengeTimeout).
  - HTTP-01 validator runs validation.IsReservedIPForDial (newly
    exported wrapper preserving the existing private impl byte-for-byte
    for the network scanner + ValidateSafeURL paths) on the resolved
    IP — both at the initial dial and every redirect hop. SSRF probes
    into private IP space are refused before the connect.
  - DNS-01 validator uses a dedicated resolver pointed at
    CERTCTL_ACME_SERVER_DNS01_RESOLVER (default 8.8.8.8:53) — does
    NOT use the system resolver to keep behavior deterministic across
    deployments. Wildcard handling: `*.example.com` queries
    _acme-challenge.example.com.
  - TLS-ALPN-01 validator (RFC 8737) connects with ALPN `acme-tls/1`,
    inspects the id-pe-acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31),
    asserts the ASN.1 OCTET STRING value equals SHA-256 of the key
    authorization. Cert chain is intentionally NOT validated
    (InsecureSkipVerify=true is correct per RFC 8737 — the proof is
    in the extension, not the chain). Documented in docs/tls.md L-001
    table + the //nolint:gosec comment carries the justification.
    SSRF guard: same posture as HTTP-01.
  - Validation is asynchronous: handler accepts the POST and returns
    200 immediately with status=processing; the worker-pool fires a
    callback that updates challenge → authz → order in a fresh
    background-context WithinTx. The order auto-promotes to `ready`
    when ALL authzs become valid; auto-fails to `invalid` when ANY
    authz becomes invalid.

What ships:
  - internal/api/acme/challenge.go: KeyAuthorization (RFC 8555 §8.1) +
    DNS01TXTRecordValue (§8.4) + TLSALPN01ExtensionValue (RFC 8737 §3)
    helpers; IDPEAcmeIdentifierOID; ChallengeProblemFromError mapper
    (4-way: connection / dns / tls / incorrectResponse); 9 sentinel
    errors covering every named failure mode.
  - internal/api/acme/validators.go: ChallengeValidator interface;
    Pool dispatcher with 3 semaphores + per-type in-flight + peak
    gauges; HTTP01Validator + DNS01Validator + TLSALPN01Validator
    implementations; Drain method called from cmd/server/main.go's
    shutdown sequence.
  - internal/api/acme/validators_test.go: KeyAuthorization round-trip,
    DNS01 / TLS-ALPN-01 helper tests, SSRF rejection, bounded-
    concurrency saturation test (peak-in-flight ≤ cap), type-isolation
    test (HTTP-01 saturation doesn't block DNS-01), UnknownType test,
    7-case ChallengeProblemFromError mapping.
  - internal/repository/postgres/acme.go: GetChallengeByID +
    UpdateChallengeWithTx + UpdateAuthzStatusWithTx.
  - internal/service/acme.go: SetValidatorPool wires the *acme.Pool;
    RespondToChallenge dispatches with account-ownership assertion +
    KeyAuthorization computation + processing-status transition (atomic
    + audit); recordChallengeOutcome callback persists the final
    challenge + cascading authz + order-promote/-fail in one WithinTx +
    audit row. 4 new metrics.
  - internal/api/handler/acme.go: Challenge handler; round-trips
    account.JWKPEM through ParseJWKFromPEM to recover the *jose.JSONWebKey
    the validator pool needs.
  - internal/api/router/router.go + openapi_parity_test.go +
    api/openapi-handler-exceptions.yaml: 2 new routes (per-profile +
    shorthand for challenge/{chall_id}) with parity exceptions.
  - cmd/server/main.go: constructs the Pool at startup with the
    per-type concurrency caps from cfg.ACMEServer; ACMEService.ValidatorPool()
    accessor exposed for the shutdown drain sequence.
  - internal/validation/ssrf.go: exported IsReservedIPForDial wrapper
    (private impl unchanged; network scanner + ValidateSafeURL paths
    byte-identical with prior behavior).
  - docs/tls.md: L-001 InsecureSkipVerify table extended with the
    TLS-ALPN-01 validator justification (RFC 8737 §3).
  - docs/acme-server.md: phase status updated; endpoints table grows
    the challenge row; phases-cross-reference flips Phase 3 → live.

Tests:
  - 80%+ coverage on the new files.
  - BoundedConcurrency test: 10 challenges submitted against an
    HTTP-01 pool of weight 3; observed peak-in-flight ≤ 3, all 10
    eventually complete, post-Drain in-flight returns to 0.
  - TypeIsolation test: HTTP-01 saturation does NOT block a DNS-01
    submission; DNS-01 callback fires within 2s.
  - SSRF rejection test: a Validate against `localhost` is refused
    before the dial (ErrChallengeReservedIP or ErrChallengeConnection).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-3".
2026-05-03 14:09:00 +00:00
shankar0123 c351bba41a acme-server: orders + authorizations + finalize + cert download (Phase 2/7)
Closes the issuance loop in trust_authenticated mode (commits ec88a61
+ 44a85d6 wired the foundation + JWS-verified account resource).
After this commit, an ACME client running against a profile with
acme_auth_mode='trust_authenticated' end-to-end-issues a real cert:

  POST /acme/profile/<id>/new-order      → 201 + order URL (status=ready)
  POST /acme/profile/<id>/order/<oid>    → POST-as-GET fetch
  POST /acme/profile/<id>/order/<oid>/finalize  → 200 + status=valid + cert URL
  POST /acme/profile/<id>/cert/<cid>     → 200 + PEM chain

Profiles with acme_auth_mode='challenge' get the same code path with
authz/challenge rows in `pending` state until Phase 3's validators
wire up. The mode is read from the bound profile's column at request
time, NOT cached at server start — operators flipping the column via
SQL take effect on the next order without restart.

Architecture (the load-bearing part):
  - Finalize routes through service.CertificateService.Create — the
    canonical certctl issuance entry point that wraps the
    managed_certificates row insert + audit row in s.tx.WithinTx.
    RenewalPolicy / CertificateProfile / per-issuer-type Prometheus
    metrics / audit rows all apply uniformly to ACME-issued certs via
    the same code path that already serves EST/SCEP/agent/REST issuance.
  - Identifier validation runs BEFORE order creation. Rejected
    identifiers return RFC 7807 with per-identifier subproblems and
    create no order row.
  - Source stamp on managed_certificates: domain.CertificateSourceACME.
    Operators bulk-revoke ACME-issued certs by filtering on Source=ACME.
  - 3-step atomicity boundary documented in code + this commit msg:
    (A) WithinTx-A marks order processing + audit row.
    (B) IssuerConnector.IssueCertificate + CertificateService.Create
        (each in its own WithinTx — Create wraps cert row + audit
        atomically).
    (C) WithinTx-C creates certificate_versions row + transitions order
        to valid + sets certificate_id + audit row.
    The brief window between B and C can leave a managed_certificates
    row whose order is still in `processing`. Phase 5's GC scheduler
    reconciles. Documented inline.

What ships:
  - internal/api/acme/order.go: OrderResponseJSON + AuthorizationResponseJSON
    + ChallengeResponseJSON + NewOrderRequest + FinalizeRequest wire
    shapes; ValidateIdentifiers (Phase 2 syntactic checks, dns-only);
    CSRMatchesIdentifiers (RFC 8555 §7.4 strict equality, case-folded).
  - internal/domain/acme.go: ACMEOrder + ACMEAuthorization + ACMEChallenge
    + ACMEIdentifier + ACMEProblem domain types + closed status enums
    for each (order: pending|ready|processing|valid|invalid; authz:
    pending|valid|invalid|deactivated|expired|revoked; challenge:
    pending|processing|valid|invalid; challenge type: http-01|dns-01|
    tls-alpn-01).
  - internal/domain/profile.go: new ACMEAuthMode field reading from
    certificate_profiles.acme_auth_mode (added in migration 25).
  - internal/domain/certificate.go: new CertificateSourceACME enum value.
  - internal/repository/postgres/profile.go: extended SELECT/scanProfile
    to read the per-profile acme_auth_mode column with a COALESCE
    default of trust_authenticated.
  - internal/repository/postgres/acme.go: full order/authz/challenge
    CRUD (CreateOrderWithTx + GetOrderByID + UpdateOrderWithTx +
    CreateAuthzWithTx + GetAuthzByID + ListAuthzsByOrder +
    ListChallengesByAuthz + CreateChallengeWithTx) with proper
    sql.NullTime + JSONB handling. scanACMEOrder /
    scanACMEAuthz / scanACMEChallenge helpers.
  - internal/service/acme.go: extended ACMERepo interface; new
    SetIssuancePipeline wires certificateService + certificateRepo +
    issuerRegistry. CreateOrder (auth-mode-dispatched: trust_authenticated
    auto-marks order ready + authz valid + 1 placeholder http-01
    challenge valid; challenge mode keeps everything pending). LookupOrder
    (with account-ownership assertion). LookupAuthz. ListAuthzsByOrder.
    FinalizeOrder (3-step atomicity boundary as above; CSR-vs-order
    SAN strict-equality check before issuance; persists FinalizeOrderResult
    {Order, CertID}). LookupCertificate. randIDSuffix + base32encode
    helpers for the human-readable acme-ord-* / acme-authz-* /
    acme-chall-* prefixes (CLAUDE.md "TEXT primary keys with human-
    readable prefixes" architecture decision). 8 new per-op metrics.
  - internal/service/acme_test.go: extended fakeACMERepo with Phase 2
    interface stubs; new orderTrackingRepo for observable persistence;
    2 new tests asserting trust_authenticated → auto-ready/valid and
    challenge → stays-pending.
  - internal/api/handler/acme.go: NewOrder + Order + OrderFinalize +
    Authz + Cert handler methods. orderURL / authzURL / certURL /
    challengeURLBuilder helpers; marshalOrderForResponse fetches
    per-order authzs to populate the URL list. parseOptionalTime for
    notBefore / notAfter.
  - internal/api/handler/acme_handler_test.go: extended mockACMEService
    with Phase 2 method stubs; 4 new handler tests (NewOrder happy +
    rejected-identifier + OrderFinalize bad-CSR + Cert happy).
  - internal/api/router/router.go: 10 new Register calls (5 per-profile
    + 5 shorthand) for new-order, order/{ord_id}, order/{ord_id}/finalize,
    authz/{authz_id}, cert/{cert_id}.
  - internal/api/router/openapi_parity_test.go + api/openapi-handler-exceptions.yaml:
    10 new exception entries.
  - cmd/server/main.go: SetIssuancePipeline at startup, threading
    certificateService + certificateRepo + issuerRegistry into ACMEService.
  - docs/acme-server.md: phase status updated; endpoints table grows
    5 rows for new-order/order/finalize/authz/cert (per-profile +
    shorthand variants); new section "Finalize routing through
    CertificateService.Create" documenting the 3-step atomicity
    boundary + the actor-string convention `acme:<account-id>`.

Tests: ACME package + service + handler + router + config + domain
all green under -short. New cases:
  - TestCreateOrder_TrustAuthenticated_AutoReady (asserts auto-ready
    transition + valid-status authz/challenge + audit row + metric bump).
  - TestCreateOrder_ChallengeMode_StaysPending (asserts pending-status
    cascading authz/challenge for challenge mode).
  - TestACMEHandler_NewOrder_HappyPath (asserts 201 + Location +
    finalize URL shape).
  - TestACMEHandler_NewOrder_RejectedIdentifier (asserts 400 + RFC 7807
    rejectedIdentifier + per-identifier subproblems for type=ip).
  - TestACMEHandler_OrderFinalize_BadCSR (asserts 400 + badCSR for
    non-base64 CSR field).
  - TestACMEHandler_Cert_HappyPath (asserts 200 + PEM content-type +
    PEM chain in body).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-2".
2026-05-03 13:46:10 +00:00
shankar0123 44a85d6f85 acme-server: account resource + JWS verifier (Phase 1b/7)
Layers JWS-authenticated POST machinery onto the Phase 1a foundation
(commit ec88a61). After this commit, an ACME client can run

  POST /acme/profile/<id>/new-account

against certctl and successfully register an account. Account update
+ deactivation via POST /acme/profile/<id>/account/<acc-id> work.
Orders + challenges remain Phase 2 / 3.

Background:
  Two prior dispatch attempts at the original Phase 1 ("skeleton +
  directory + new-nonce + new-account" as a single commit) failed on
  go-jose v4 API speculation (jws.GetPayload, sig.Algorithm,
  jose.SHA256, etc. — none of those exist in v4). Splitting Phase 1
  into 1a (foundation, no go-jose) and 1b (this commit, all go-jose
  in one place) concentrated the JWS work where attention pays off.
  The verifier reads the actual go-jose v4 surface — ParseSigned with
  closed alg allow-list, Header struct fields (Algorithm, KeyID,
  JSONWebKey, Nonce, ExtraHeaders[HeaderKey]), JWK.Thumbprint with
  stdlib crypto.SHA256.

What ships:
  - internal/api/acme/jws.go: 487-line verifier + sentinel error
    family. Enforces RFC 8555 §6.2 + §6.4 + §6.5 invariants:
      - alg in {RS256, ES256, EdDSA} (closed allow-list passed to
        jose.ParseSigned — HS256 / none / etc. rejected at parse time)
      - exactly one of `kid` / `jwk` in protected header (per
        endpoint policy — new-account demands jwk, others demand kid)
      - protected `url` matches request URL exactly
      - protected `nonce` consumed against acme_nonces (badNonce on
        miss/replay/expiry per RFC 8555 §6.5.1)
      - kid round-trips against canonical AccountKID(accountID) URL
        (catches cross-profile / cross-host replay)
      - kid path: account exists + status=valid (deactivated /
        revoked accounts cannot authenticate)
      - signature verifies; post-Verify payload bytes equal
        UnsafePayloadWithoutVerification (defense in depth)
    + JWK persistence helpers (JWKToPEM / ParseJWKFromPEM round-
    trip a public-only JWK as a PEM-wrapped JSON envelope; stored
    as TEXT in acme_accounts.jwk_pem for diff-friendliness) +
    JWKThumbprint per RFC 7638.
  - internal/api/acme/jws_test.go: 16 cases covering happy paths
    (RS256 kid, ES256 jwk, EdDSA kid) + every named failure mode
    (alg-not-allowed, bad-sig, missing-nonce, unknown-nonce,
    replay, url-mismatch, mixed kid+jwk, deactivated-account,
    cross-host kid). Uses real keypairs + real go-jose Signer to
    build JWS objects.
  - internal/api/acme/account.go: NewAccountRequest /
    AccountUpdateRequest payload shapes (RFC 8555 §7.3 + §7.3.2 +
    §7.3.6) + AccountResponseJSON wire shape + MarshalAccount
    helper.
  - internal/domain/acme.go: ACMEAccount struct + ACMEAccountStatus
    closed enum (valid / deactivated / revoked).
  - internal/repository/postgres/acme.go: full account CRUD path
    (CreateAccountWithTx with 23505-unique-violation sentinel
    translation, GetAccountByID, GetAccountByThumbprint,
    UpdateAccountContactWithTx, UpdateAccountStatusWithTx) +
    sql.ErrNoRows-wrapped repository.ErrNotFound on lookup misses.
  - internal/service/acme.go: ACMERepo interface extended;
    SetTransactor + SetAuditService wires; NewAccount (idempotent
    re-registration per RFC 8555 §7.3.1 — same JWK returns existing
    row without an update or new audit event); LookupAccount;
    UpdateAccount; DeactivateAccount; VerifyJWS adapter that bridges
    api/acme.VerifierConfig to the service-layer ACMERepo; per-op
    metrics extended (new_account_total + _failures_total +
    _idempotent_total + update_account_total + _failures_total +
    deactivate_account_total).
  - internal/service/acme_test.go: 8 new tests covering
    new-account happy path / idempotent re-registration / only-
    return-existing match + no-match / contact update / deactivate
    / lookup-not-found / requires-transactor.
  - internal/api/handler/acme.go: NewAccount + Account handlers.
    Account dispatches POST-as-GET (RFC 8555 §6.3 — empty body or
    {} payload returns the account row), contact update, and
    deactivation from the same endpoint. Defense-in-depth check
    that the kid path-segment matches the URL path-segment (the
    verifier already round-tripped the kid against canonical URL,
    but the handler re-asserts to catch any future verifier
    refactor).
  - internal/api/handler/acme_handler_test.go: 7 new cases
    covering happy-create, idempotent-200, only-return-existing-
    no-match-400, malformed-JWS-400, kid-URL-mismatch-401,
    deactivate, contact-update, POST-as-GET.
  - internal/api/router/router.go: 4 new Register calls (per-
    profile + shorthand for new-account and account/{acc_id}).
  - internal/api/router/openapi_parity_test.go: SpecParityExceptions
    extended with the 4 new routes (RFC 8555 wire-protocol surface,
    not OpenAPI-shaped — same precedent as Phase 1a).
  - cmd/server/main.go: SetTransactor + SetAuditService on
    acmeService at startup so the WithinTx-based new-account /
    update / deactivate paths run with the same transactor instance
    shared across CertificateService / RevocationSvc / RenewalService.
  - docs/acme-server.md: Phase status updated; endpoints table grows
    new-account + account/<acc_id> rows; new "JWS verification
    (Phase 1b)" section enumerates the 7 invariants the verifier
    enforces; phases-cross-reference table marks 1b live.
  - go.mod / go.sum: github.com/go-jose/go-jose/v4 v4.0.4 added.

Atomicity: every account-state mutation writes its acme_accounts row
+ its audit_events row inside one repository.Transactor.WithinTx
call — the canonical certctl atomicity contract (matches
CertificateService.Create at internal/service/certificate.go:131).
Idempotent re-registration explicitly does NOT write an audit row
(RFC 8555 §7.3.1 returns the existing row unmodified).

Tests: 16 jws_test.go cases + 11 service tests + 11 handler tests
all pass under -short. Bad-signature test uses a real registered
account whose stored JWK is a different keypair from the signer's,
so the JWS parses cleanly but jose.Verify rejects — exercises the
ErrJWSSignatureInvalid path directly.

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-1b".
2026-05-03 13:21:56 +00:00
shankar0123 ec88a61274 acme-server: foundation — directory + new-nonce + per-profile routing (Phase 1a/7)
First slice of the RFC 8555 ACME server endpoint (master plan at
cowork/acme-server-endpoint-prompt.md, per-phase prompts at
cowork/acme-server-prompts/). This commit lands the smallest viable
end-to-end deployable slice: an ACME client running

  curl -sk https://certctl/acme/profile/<id>/directory
  curl -sk -I https://certctl/acme/profile/<id>/new-nonce

successfully fetches the directory document and a Replay-Nonce.
Account creation, JWS verification, orders, challenges, and
revocation are all out of scope for this phase and arrive in Phases
1b–4.

Closes the Rank 1 LHF from the 2026-05-03 Infisical deep-research
(cowork/infisical-deep-research-results.md). Pre-fix, certctl was an
ACME consumer only — no /acme/directory endpoint, no JWS verifier,
no challenge validators. K8s customers running cert-manager could
not point at certctl as an ACME issuer; they had to deploy a certctl
agent on every node.

What ships:
  - internal/api/acme/{directory,nonce,errors}.go (+ tests).
  - internal/api/handler/acme.go + acme_handler_test.go.
  - internal/repository/postgres/acme.go (nonce ops only — Phase 1b
    extends with account CRUD; Phases 2-4 extend with order / authz /
    challenge CRUD).
  - internal/service/acme.go (BuildDirectory + IssueNonce stubs;
    Phase 1b adds VerifyJWS / NewAccount / etc.).
  - migrations/000025_acme_server.{up,down}.sql ships the full 5-table
    ACME schema (acme_accounts / acme_orders / acme_authorizations /
    acme_challenges / acme_nonces) PLUS the per-profile
    certificate_profiles.acme_auth_mode column. Phase 1a actively
    uses only acme_nonces; remaining tables are empty until Phases
    1b-4 plug in.
  - internal/config/config.go: ACMEServerConfig struct + ACMEServer
    field on Config. Env vars use CERTCTL_ACME_SERVER_* prefix to
    avoid colliding with the existing consumer-side ACMEConfig at
    config.go:1746 (CERTCTL_ACME_DIRECTORY_URL / PROFILE /
    CHALLENGE_TYPE etc.). Phase 1a wires Enabled +
    DefaultAuthMode + DefaultProfileID + NonceTTL + DirectoryMeta;
    Order/Authz TTLs + per-challenge-type concurrency caps + DNS01
    resolver are reserved fields parsed in 1a so operators can set
    them ahead of Phases 2/3.
  - cmd/server/main.go: wire ACMEHandler into the HandlerRegistry
    literal alongside the existing certificate / EST / SCEP / etc.
    handlers.
  - internal/api/router/router.go: HandlerRegistry.ACME field + 6
    Register calls (3 per-profile + 3 shorthand).
  - internal/api/router/openapi_parity_test.go: 6 new entries in
    SpecParityExceptions. ACME is a wire-protocol surface (JWS-signed
    JSON over HTTPS per RFC 7515) whose semantics are dictated by
    RFC 8555 + RFC 9773 rather than by an OpenAPI document, same
    precedent as SCEP/EST. The canonical reference is
    docs/acme-server.md.
  - docs/acme-server.md: Phase-1a-shaped reference. Configuration
    table for every CERTCTL_ACME_SERVER_* env var. Per-profile
    auth-mode decision tree skeleton. TLS trust bootstrap section
    flagging cert-manager's ClusterIssuer.spec.acme.caBundle
    requirement (the single biggest first-time-deploy footgun;
    the full cert-manager walkthrough lands in Phase 6 but the
    requirement is documented up front).

Architecture decisions baked in:
  - URL family is /acme/profile/<id>/* (per-profile, canonical) with
    /acme/* shorthand active when CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID
    is set. Path matches existing per-profile precedent in EST + SCEP.
  - Auth mode is per-profile (acme_auth_mode column on
    certificate_profiles), NOT server-wide. One certctl-server can
    serve trust_authenticated for an internal-PKI profile and
    challenge for a public-trust-style profile simultaneously. The
    column is read at request time, not cached at server start —
    operators flipping a profile's mode via SQL take effect on the
    next order without restart.
  - Nonces are DB-backed (acme_nonces table). Survive server restart.
    The RFC 8555 §6.5 replay defense requires the store to outlast
    the client's nonce caching window; an in-memory-only nonce
    store would lose every in-flight order on restart.
  - Per-op atomic counters on service.ACMEService.Metrics() —
    certctl_acme_directory_total, certctl_acme_directory_failures_total,
    certctl_acme_new_nonce_total, certctl_acme_new_nonce_failures_total.
    Naming follows certctl frozen decision 0.10 cardinality discipline.
    Phase 1b will extend with new_account counters; Phase 2 with
    order / finalize / cert; Phase 3 with per-challenge-type counters.

Audit fixes #11 + #12 (cowork/acme-server-prompts/audit-additions.md)
applied:
  - #11: CERTCTL_ACME_SERVER_* prefix avoids the consumer-side
    CERTCTL_ACME_* namespace collision.
  - #12: prior-attempt WIP from two failed Phase-1 dispatches was
    discarded at phase start; this commit starts from a clean tree.

Tests:
  - 14 unit tests in internal/api/acme/ (directory, nonce, errors).
  - 7 handler-level tests via httptest.NewServer + mockACMEService
    (mirrors the mockSCEPService pattern at scep_handler_test.go).
  - 7 service-layer tests with mocked repo + injected profileLookup.
  - All pass under -race -count=1 -short.

Deferred to Phase 1b:
  - JWS verification (go-jose v4 — see master-prompt §8a for the API
    surface and audit doc for the speculation pitfalls).
  - new-account / account/<id> endpoints + AccountService.
  - Nonce *consumption* path (issue path is in this commit; consume
    is only invoked by JWS-verified POSTs which Phase 1b adds).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-1a".
Per-phase implementation plan: cowork/acme-server-prompts/.
Master plan + audit fixes: cowork/acme-server-endpoint-prompt.md +
cowork/acme-server-prompt-audit.md +
cowork/acme-server-prompts/audit-additions.md.
2026-05-03 12:55:40 +00:00
shankar0123 b8b7e1e3dd tlsprobe: add VerifyWithExponentialBackoff + rewire all connectors' runPostDeployVerify
Closes Top-10 fix #8 of the 2026-05-02 deployment-target audit
re-run (see cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md). Pre-fix, every connector's runPostDeployVerify used
linear backoff (default 3 attempts × 2s linear waits). Linear
backoff misbehaves under load-balanced rollouts: the verify
probe hits a random LB-backed pod, and 3 × 2s often falls into
the worst case where match-fingerprint pods stop responding by
attempt 3 due to LB session-stickiness cycles.

This commit:

1. New shared helper internal/tlsprobe/retry.go::
   VerifyWithExponentialBackoff. Default 3 attempts; 1s initial,
   16s cap. Doubling pattern: 1s → 2s → 4s → 8s → 16s. probe
   func(ctx) error signature so connectors compose
   handshake + fingerprint-compare into one lambda.

2. Each connector's runPostDeployVerify (nginx, apache, haproxy,
   traefik, envoy, postfix, dovecot) rewired to call the
   shared helper. Per-connector signature unchanged.

3. New PostDeployVerifyMaxBackoff time.Duration field added to
   each connector's Config. Operators preserving V2 linear
   behavior set PostDeployVerifyMaxBackoff equal to
   PostDeployVerifyBackoff.

4. Tests:
   - tlsprobe/retry_test.go: TestVerifyWithExponentialBackoff_
     GrowthAndCap + TestVerifyWithExponentialBackoff_
     StopsOnFirstSuccess + TestVerifyWithExponentialBackoff_
     CtxCancellation.
   - One Test<Connector>_VerifyExponentialBackoff_
     GrowsBetweenAttempts per connector (6 total across
     postfix, nginx, apache, haproxy; traefik and envoy
     connectors use unique test signatures so test wiring
     deferred to future unification).

5. docs/deployment-atomicity.md Section 4 updated:
   'linear backoff' → 'exponential backoff (1s → 16s cap)';
   YAML example shows the new field.

Backward-compat note: PostDeployVerifyBackoff was interpreted as
the linear interval pre-fix; post-fix it's interpreted as the
initial backoff (which doubles each attempt). Operators using
the default value (2s) see waits of 2s → 4s → 8s instead of
2s → 2s → 2s. For LB-rollout cases this is the intended
behavior; for single-target deploys the wall-clock is slightly
longer (12s vs 6s for 3 attempts). Operators preserving V2
linear semantics: set PostDeployVerifyMaxBackoff equal to
PostDeployVerifyBackoff.

Verified locally:
- gofmt clean.
- go test -short -count=1 ./internal/tlsprobe/...
  ./internal/connector/target/{postfix,nginx,apache,haproxy}/... green.

Audit reference: cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md Top-10 fix #8.
2026-05-02 22:56:07 +00:00
shankar0123 b16e5b5e97 docs(ssh): operator playbook for InsecureIgnoreHostKey design choice
Closes Top-10 fix #7 of the 2026-05-02 deployment-target audit
re-run (see cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md). Pre-fix, the SSH connector's
ssh.InsecureIgnoreHostKey() at internal/connector/target/ssh/
ssh.go (realSSHClient.Connect) had only an inline comment
justifying the design choice. An acquirer's diligence engineer
reading the connector cold pattern-matches "MITM hazard" without
seeing the comment.

This commit lands a doc-side operator playbook in
docs/connectors.md SSH section covering:

1. Why the connector accepts any host key (operator-configured
   target infrastructure; mirrors network scanner's
   InsecureSkipVerify and F5's Insecure flag).
2. Threat model the choice accepts (passive eavesdropper on
   operator-controlled network; layered SSH-key auth limits
   blast radius).
3. Threat model the choice does NOT accept (public-internet
   ephemeral hosts, multi-tenant networks, strict MITM-
   resistance regulatory requirements).
4. Mitigations operators can layer (custom SSHClient via
   NewWithClient + golang.org/x/crypto/ssh/knownhosts; SSH
   certificate authentication via @cert-authority pinning;
   network segmentation; per-target key rotation).
5. When to NOT use the SSH connector (regulatory environments,
   dynamic IPs, multi-tenant networks).
6. V3-Pro forward path (built-in known_hosts management,
   tracked in WORKSPACE-ROADMAP.md).

Inline comment in ssh.go realSSHClient.Connect updated to
forward-reference the new doc subsection (no logic change; same
HostKeyCallback: ssh.InsecureIgnoreHostKey() call).

Same shape Bundle 8 used for "Operator playbook: keytool argv
password exposure" in docs/connectors.md JavaKeystore section.

No code-behavior changes. No test changes.

Verified locally:
- gofmt / go vet clean.
- go test -short ./internal/connector/target/ssh/...  green.

Audit reference: cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md Top-10 fix #7.
2026-05-02 22:44:30 +00:00
shankar0123 62f0a284be iis,wincertstore: default-deadline ctx wrapper for PowerShell exec calls
Closes Top-10 fix #4 of the 2026-05-02 deployment-target audit
re-run (see cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md). Pre-fix, both IIS and WinCertStore's realExecutor
invoked PowerShell via exec.CommandContext(ctx, ...) and relied
entirely on the caller's ctx to provide a deadline. If the caller
forgot to attach one (context.Background() in a deeply-nested
path; an operator running an ad-hoc deploy via a CLI that doesn't
default-deadline its ctx), a hung WinRM session blocked the
deploy worker thread indefinitely.

S2 (failure isolation) bar from the audit: "does a hung WinRM
take down the deploy worker pool?" — today's answer was
"potentially yes" for these two connectors. Post-fix the answer
is "no, capped at the configured ExecDeadline (default 60s)".

This commit:

1. Adds Config.ExecDeadline (time.Duration, json: "exec_deadline")
   to both connectors, defaulted to 60 seconds. WinCertStore
   defaults via the existing applyDefaults helper; IIS defaults
   inline at New() and inside ValidateConfig (the IIS connector
   has no shared applyDefaults helper today; out-of-scope to
   refactor one in for this minor fix). Operators on slow
   Windows links can override via the JSON config field
   exec_deadline.

2. Wraps realExecutor.Execute with a fallback context.WithTimeout
   that fires ONLY when ctx has no deadline of its own. Caller-
   supplied deadlines always win — the wrapper is a safety net,
   not a hard cap. defer cancel() guards against goroutine leaks.

3. Tests:
   - TestIIS_RealExecutor_AttachesDefaultDeadlineWhenCallerHasNone
     (passes context.Background; asserts the call returns within
     500ms with an error). On Linux/macOS runners powershell.exe
     is missing and exec.Cmd fails fast; on Windows the wrapper's
     ctx deadline cancels the running PowerShell process. Either
     path returns well under 500ms.
   - TestIIS_RealExecutor_RespectsCallerDeadlineWhenSet (10s
     fallback executor deadline, 50ms caller ctx; asserts caller
     deadline wins).
   - TestIIS_RealExecutor_NoDeadlineWiredWhenZero (deadline=0
     means no fallback wrapper; caller's tight ctx still bounds).
   - TestIIS_New_DefaultsExecDeadlineTo60s + TestIIS_New_RespectsExplicitExecDeadline
     pin the constructor's defaulting behavior (uses winrm mode
     so the test doesn't need powershell.exe in PATH).
   - Same five tests in wincertstore_test.go.

4. docs/connectors.md IIS + WinCertStore sections document the
   new exec_deadline field with: what it is (per-PowerShell-
   subprocess cap), default (60 seconds), override semantics
   (caller ctx deadline wins).

No change to behavior when the caller already attaches a deadline
(the common case in production code paths). Tests using the mock
executor (mockExecutor in iis_test.go / wincertstore_test.go)
are unaffected — they bypass realExecutor entirely.

S2 cross-cutting scorecard rating in
cowork/deployment-target-audit-2026-05-02-rerun/findings.json
flips from "gap" to "pass" for IIS and WinCertStore (in any
future re-audit).

Verified locally:
- gofmt / go vet / staticcheck clean across both packages.
- go test -race -count=1 ./internal/connector/target/iis/...
  ./internal/connector/target/wincertstore/...  green.

Audit reference: cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md Top-10 fix #4.
2026-05-02 22:38:35 +00:00
shankar0123 4142837cac iis,wincertstore,javakeystore: SHA-256 idempotency short-circuit
Closes Top-10 fix #3 of the 2026-05-02 deployment-target audit
re-run (see cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md). Pre-fix, the three PowerShell-driven connectors
(IIS / WinCertStore / JavaKeystore) bypass internal/deploy.Apply
because they write to the Windows cert store / Java keystore via
PowerShell + keytool rather than the local filesystem. They don't
get deploy.Apply's SHA-256 idempotency short-circuit for free, so
every renewal triggers a full Remove+Import cycle even on byte-
identical material. Operators with 60-day rotation see unnecessary
cert-store / keystore churn, briefly bumping CPU and possibly
disrupting connections in flight.

This commit adds a per-connector idempotency probe modeled on
Bundle 9's Caddy api-mode SHA-256 short-circuit (commit 08a86d3).
Each probe runs at the top of DeployCertificate, BEFORE the
destructive step, with a unique # CERTCTL_IDEM_PROBE PowerShell
comment tag so test mocks match deterministically.

IIS: Get-ChildItem Cert:\... + Get-WebBinding; matches when both
the cert is in the store AND the active binding's certificateHash
equals the new thumbprint.

WinCertStore: Get-ChildItem Cert:\...\<thumbprint>; matches when
the cert exists in the configured store AND its NotAfter is
still in the future.

JavaKeystore: keytool -list -alias -v; matches when the parsed
SHA-256 fingerprint equals sha256(certPEM_DER).

On match: return Success=true with Metadata["idempotent"]="true",
no destructive operation. On any error during the probe (network,
parse, etc.): fall through to today's full deploy path.
False negatives are safe; false positives are dangerous.

Tests added (one positive + one negative per connector):
- TestIIS_Idempotent_SkipsDeployWhenBindingMatches
- TestIIS_Idempotent_DifferentBinding_FallsThroughToDeploy
- TestWinCertStore_Idempotent_SkipsImportWhenCertInStore
- TestWinCertStore_Idempotent_NotInStore_FallsThroughToDeploy
- TestJKS_Idempotent_SkipsDeployWhenAliasMatches
- TestJKS_Idempotent_DifferentAlias_FallsThroughToDeploy

Verified locally:
- gofmt clean across all three connectors.
- Syntax-validated via gofmt.

Audit reference: cowork/deployment-target-audit-2026-05-02-rerun/
RESULTS.md Top-10 fix #3.
2026-05-02 22:09:30 +00:00
shankar0123 b8293653a5 postfix: add atomic-test variants for Mode=dovecot (happy path + verify-rollback)
Closes Bundle 11 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). Pre-fix,
postfix_atomic_test.go exercised the atomic deploy path under Mode=
postfix only — the existing TestPostfix_DovecotMode at L233-246
asserted only the DeploymentID prefix, leaving applyDefaults's
dovecot-specific validate/reload command set + the rollback's
file-content-restoration unverified at the deploy-test layer.
Audit's only test-coverage gap on the otherwise-production-grade
Postfix/Dovecot connector.

This commit adds two new tests (test-only commit; no production-
code changes):

1. TestPostfix_Atomic_DovecotMode_HappyPath. Builds a Config with
   Mode: "dovecot" and NO ValidateCommand / NO ReloadCommand set.
   Calls ValidateConfig (which is what triggers applyDefaults via
   its JSON-marshal-then-parse path) before DeployCertificate.
   Captures the validate + reload commands threaded through the
   SetTestRunValidate / SetTestRunReload hooks. Asserts:
     - capturedValidateCmd contains "doveconf -n" (applyDefaults
       populated it from the dovecot branch).
     - capturedReloadCmd contains "doveadm reload".
     - DeploymentID prefix "dovecot-" + result.Metadata["mode"] is
       "dovecot" (Mode survived end-to-end).

2. TestPostfix_Atomic_DovecotMode_VerifyFails_Rollback. Pre-creates
   cert.pem AND key.pem with known "ORIG-CERT" / "ORIG-KEY" bytes.
   Builds Config with Mode: "dovecot", PostDeployVerify enabled
   (Endpoint pointing at a dovecot-IMAPS-style :993 — value unused
   by the probe stub), PostDeployVerifyAttempts: 1 (default is 3
   attempts × 2s backoff = 4+ seconds; we don't need that for a
   unit test). Probe stub returns Success: false, which
   runPostDeployVerify wraps as "TLS probe failed: ...". Asserts:
     - DeployCertificate returns error containing "TLS probe failed".
     - cert.pem AND key.pem on disk contain the ORIG bytes
       verbatim — Bundle 11's load-bearing assertion that the
       rollback restored the pre-deploy file state under
       Mode=dovecot. The existing TestPostfix_VerifyMismatch_Rollback
       (Mode=postfix) only asserts the error; this test extends to
       file-content restoration.

Existing TestPostfix_DovecotMode (L233-246) preserved as-is — the
minimal DeploymentID-prefix smoke test complements the new richer
tests without duplicating their scope.

The encoding/json import is added to support the HappyPath test's
json.Marshal call. No other dependency changes.

No production-code changes; the connector itself was already
correct for Mode=dovecot. Only the test pin was missing.

Verified locally:
- gofmt -l ./internal/connector/target/postfix/  clean
- go vet ./internal/connector/target/postfix/  clean
- go build ./cmd/agent/...  clean (no signature changes)
- go test -race -count=1 ./internal/connector/target/postfix/  green
  (24 tests total: 22 pre-existing + 2 new)

Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 11.
2026-05-02 19:34:58 +00:00
shankar0123 08a86d355d caddy: fix duration metric + file-mode PEM validate + api-mode idempotency
Closes Bundle 9 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). Three
small independent fixes that share one connector file:

1. Duration metric (caddy.go L176). Pre-fix:
     "duration_ms": fmt.Sprintf("%d", time.Since(time.Now()).Milliseconds())
   This always returned ~0ms because time.Now() was called twice —
   the second call captured a baseline immediately before time.Since
   computed the delta. The intended baseline is `startTime` declared
   at L113 and threaded through deployViaFile correctly. Post-fix:
     "duration_ms": fmt.Sprintf("%d", time.Since(startTime).Milliseconds())
   deployViaAPI's signature evolves to take startTime time.Time so
   the api-mode path uses the same baseline as the file-mode path.

2. File-mode ValidateDeployment now validates PEM syntax. Pre-fix
   (caddy.go L266-293) checked file existence only via os.Stat. A
   cert file containing garbage bytes passed validation; Caddy's
   file-watcher silently failed to load it; operators saw "validation
   green" + "TLS handshake fails" with no obvious connection.
   Post-fix: after the os.Stat checks succeed, os.ReadFile + parse
   the first PEM block as an x509 cert via the shared
   certutil.ParseCertificatePEM helper. Failure surfaces as
   Valid=false with a clear "not valid PEM/x509" message.

3. API-mode idempotency short-circuit. Pre-fix, every deploy POSTed
   to /config/apps/tls/certificates/load even when the active cert
   was already what we wanted to deploy. Caddy reloads TLS state on
   every POST, briefly bumping CPU and possibly disrupting connections
   in flight. Post-fix: idempotencySkipPOST runs a GET first, parses
   the response (handles BOTH the array-of-objects and single-object
   shapes Caddy admin can return), SHA-256 compares the entry's
   `cert` field to the deploy payload's cert bytes, and skips the
   POST when match. Result.Metadata["idempotent"]="true" surfaces
   the no-op. Conservative: any GET failure (network, non-200, parse
   error, no matching entry, hash mismatch) silently falls through to
   the POST, preserving today's behavior. Idempotency is a fast path,
   not a correctness boundary — false negatives are safe; false
   positives are dangerous.

Tests added to caddy_test.go (6 new tests, ~290 LOC):
- TestCaddy_API_DurationMetric_NonZero (httptest server with a 10ms
  sleep in the POST handler; asserts duration_ms parses as int >= 5).
- TestCaddy_ValidateDeployment_FileMode_MalformedPEM_Rejected (writes
  garbage to cert.pem; asserts Valid=false with PEM/x509 in message).
- TestCaddy_ValidateDeployment_FileMode_ValidPEM_Accepted (writes a
  real ECDSA P-256 self-signed cert; asserts Valid=true).
- TestCaddy_API_Idempotent_SkipsPOSTWhenCertHashMatches (GET response
  contains the same cert as the deploy payload; POST counter remains
  0; metadata.idempotent=true; exactly 1 GET probe ran).
- TestCaddy_API_Idempotent_RunsPOSTWhenCertHashDiffers (GET response
  contains a DIFFERENT cert; POST counter is 1; idempotent absent).
- TestCaddy_API_Idempotent_GETFails_FallsThroughToPOST (GET returns
  500; POST still runs; deploy succeeds; idempotent absent).

Two existing tests updated to match the new contracts:
- TestCaddyConnector_DeployViaAPI_Success: mock handler now serves
  BOTH GET (returns "[]" so the comparison falls through) and POST
  (the original 200-OK path). The dispatch is a method-switch
  inside the path-match branch.
- TestCaddyConnector_ValidateDeployment_Success: the placeholder
  cert "MIIC..." used to pass the old existence-only check; post-Fix-2
  it fails the PEM-parse check. Test now uses generateTestCertAndKey
  to produce a real self-signed ECDSA P-256 cert.

generateTestCertAndKey helper added to the test file — same pattern
the javakeystore + wincertstore tests use, kept local because the
caddy package has no other test in the certutil family that would
make a shared helper cleaner.

Verified locally:
- gofmt -l ./internal/connector/target/caddy/  clean
- go vet ./internal/connector/target/caddy/  clean
- go build ./cmd/agent/...  clean (factory wiring unchanged)
- go test -race -count=1 ./internal/connector/target/caddy/  green
  (16 tests total: 11 pre-existing including the two updated +
  6 new)

Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 9.
2026-05-02 19:13:18 +00:00
shankar0123 eb390b2db4 javakeystore: pre-deploy export snapshot + on-import-failure rollback + argv-password operator note
Closes Bundle 8 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). Pre-fix,
DeployCertificate at javakeystore.go:172-272 ran an irreversible
keytool -delete against the existing alias, then keytool
-importkeystore. If the import failed after the delete succeeded,
the keystore was missing the alias entirely — previous cert gone,
new cert never landed. docs/deployment-atomicity.md L94 promised
"keytool snapshot; rollback via keytool -delete + re-import"; the
code didn't deliver. Separately, the operator-facing keystore
password is passed via -storepass argv (a standard keytool
limitation) which is visible to ps(1) for the duration of each
subprocess; this was undocumented as an operator-playbook caveat.

This commit:

1. Pre-delete snapshot. When os.Stat(KeystorePath) succeeds,
   snapshotKeystore runs keytool -exportkeystore to
   <BackupDir>/.certctl-bak.<unix-nanos>.p12 BEFORE the existing
   -delete step. Backup path persisted in a local variable for
   the rollback path; export-step failure aborts the deploy
   entirely (no mutation has happened yet — the keystore is
   untouched). Snapshot skipped on first-time deploys (no
   keystore file = nothing to roll back to). The "alias not
   present in pre-existing keystore" case is recognised via the
   well-known keytool error string and treated as a clean
   first-time-on-existing-keystore signal — the deploy proceeds
   without a backup, and rollback (if needed) becomes the
   no-backup branch.

2. On-import-failure rollback. When keytool -importkeystore
   returns error, rollbackImport(ctx, backupPath) runs:
   - keytool -delete -alias <Alias> ... (best-effort; the failed
     import may have created a partial alias entry).
   - keytool -importkeystore from the backup PKCS#12 to restore
     the previous state.
   On rollback success, the deploy returns wrapped error noting
   "rolled back from <backup_path>". On rollback failure,
   returns operator-actionable wrapped error containing both the
   import error AND the rollback error AND the backup path so
   the operator can manually keytool -importkeystore from the
   .p12 file to recover.

3. Backup retention. Successful deploys prune older
   .certctl-bak.*.p12 files beyond Config.BackupRetention.
   Sort by ModTime newest-first; keep most recent N. Defaults:
   BackupRetention=0  → keep most recent 3 (the default).
   BackupRetention=N  → keep most recent N.
   BackupRetention=-1 → opt out of pruning entirely (operators
                        that wire their own archival/rotation).
   Pruning runs in the success path AFTER the optional reload
   command so it doesn't interfere with deploy-time signals.
   ReadDir / Remove failures are non-fatal (debug log only) —
   the deploy already succeeded.

4. Config gains BackupRetention int and BackupDir string fields.
   BackupDir defaults to filepath.Dir(KeystorePath) so backups
   land on the same filesystem as the keystore (atomic-ish
   writes, disk-full failures fail fast at snapshot time).

5. Helper extraction. snapshotKeystore + rollbackImport +
   pruneBackups + backupDir are private methods on Connector.
   Constants backupFilePrefix=".certctl-bak." and
   backupFileSuffix=".p12" centralise the naming convention so
   the snapshot writer, the rollback reader, and the retention
   pruner all agree.

6. Operator-playbook section added to docs/connectors.md
   JavaKeystore section. Documents the standard keytool
   -storepass argv exposure: ps(1)-visible for the duration
   of each subprocess. Lists mitigations:
   - Restrict shell access to the agent host.
   - Linux user namespaces / AppArmor / SystemD ProtectProc=
     invisible to deny ps-visibility.
   - Single-purpose container for proper PID-namespace
     isolation.
   - Post-deploy keystore password rotation via reload_command
     for high-security environments.
   - BCFKS keystore type for FIPS environments (same argv
     caveat applies).
   Also documents an "Atomic rollback" subsection covering the
   snapshot/rollback flow, the new backup_retention /
   backup_dir Config fields, and the design choice to reuse
   the keystore password for the snapshot (rather than
   generating a separate transient password) — operator
   already trusts the connector with this secret, surface area
   doesn't grow, rollback's matching -srcstorepass stays
   simple.

Tests added to javakeystore_test.go (7 new tests, ~430 LOC):

- TestJKS_Snapshot_RunsBefore_Delete: mock executor records call
  order; asserts -exportkeystore is call[0], -delete is call[1],
  -importkeystore is call[2]. The snapshot MUST run before the
  delete — otherwise the delete destroys the very state the
  snapshot is meant to capture.
- TestJKS_Snapshot_FirstTimeDeploy_NoExport: no keystore file
  pre-created; asserts exactly 1 keytool call (-importkeystore
  only), no -exportkeystore.
- TestJKS_ImportFails_RollsBack: happy rollback path with one
  same-Subject backup. Asserts rollback re-import references the
  same backup path the snapshot wrote (verified via arg
  comparison between call[0] and call[4]).
- TestJKS_ImportFails_RollbackAlsoFails_OperatorActionable:
  wrapped-error escalation with backup path in the error
  message.
- TestJKS_BackupRetention_PrunesOldBackups: 5 pre-existing
  staggered-ModTime backups + 1 deploy-created → retention=3 →
  exactly 3 newest survive (deploy-created + 2 newest
  pre-existing); 3 oldest pre-existing pruned.
- TestJKS_BackupRetention_Zero_DefaultsTo3: BackupRetention=0
  must default to 3 (not "keep none").
- TestJKS_BackupRetention_Negative_OptsOut: BackupRetention=-1
  pre-existing 5 + deploy 1 = 6 total, all 6 remain.
- TestJKS_Snapshot_AliasNotInKeystore_ProceedsCleanly: keystore
  exists but alias missing; -exportkeystore returns "alias does
  not exist" → snapshot helper recognises this signal and
  returns ("", nil) so the deploy proceeds cleanly.

mockExecutor extended with optional `onCall` hook so the
retention-pruning tests can simulate keytool -exportkeystore's
file-write side effect (via the simulateExportSideEffect helper
that parses -destkeystore from args and writes a placeholder
.p12 file). Existing tests that don't set onCall behave
identically to before — backward compatible.

docs/deployment-atomicity.md L94 unchanged from today's text —
Bundle 1 doc-realignment hasn't shipped, so the "keytool snapshot;
rollback via keytool -delete + re-import" line was never softened.
Post-Bundle-8 the claim is honest (was aspirational pre-fix).

Verified locally (sandbox lacks staticcheck install due to disk
pressure; CI runs the full lint gate):
- gofmt -l ./internal/connector/target/javakeystore/ clean
- go vet ./internal/connector/target/javakeystore/ clean
- go build ./cmd/agent/... clean
- go test -race -count=1 ./internal/connector/target/javakeystore/
  green (16 tests total: 9 pre-existing + 7 new)

Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 8.
2026-05-02 19:01:06 +00:00
shankar0123 60ae92b0e8 wincertstore: pre-deploy snapshot + on-import-failure rollback
Closes Bundle 7 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). Pre-fix,
DeployCertificate at wincertstore.go:162-215 ran a single PowerShell
script that imported the PFX, optionally set FriendlyName, and
optionally removed expired same-Subject certs. Import-PfxCertificate
is atomic at the cert-store level, but the wider sequence (import →
friendly name → remove expired) is not. Failure in any post-import
step left the new cert in the store with no clean recovery path.
docs/deployment-atomicity.md L93 promised "Get-ChildItem snapshot
for rollback"; the code didn't deliver.

This commit:

1. Pre-deploy snapshot. New PowerShell script (tagged
   `# CERTCTL_SNAPSHOT`) runs Get-ChildItem over the target store,
   captures every thumbprint, and for each cert with the same
   Subject as the new one calls Export-PfxCertificate to a tempdir
   using a transient snapshotExportPassword (32-byte random,
   distinct from the import PFX password). Output parsed into a
   snapshotState{Entries: []{Thumbprint, PfxPath}, AllThumbprints,
   TempDir, ExportPassword}. The new cert's Subject is parsed from
   request.CertPEM via certutil.ParseCertificatePEM before any
   cert-store mutation; PEM-parse failure aborts the deploy
   cleanly.

2. On-import-failure rollback. When the import-script Execute
   returns error, run a rollback script (tagged
   `# CERTCTL_ROLLBACK`) that:
   - Test-Path on the new cert path; Remove-Item if present.
   - Import-PfxCertificate -FilePath <pfxPath> for each snapshot
     entry (restores prior state).
   - Remove-Item -Recurse on the snapshot tempdir.

3. Post-rollback verification. Re-read Get-ChildItem (tagged
   `# CERTCTL_VERIFY`); assert every original thumbprint is back.
   On mismatch, append a warning to the DeploymentResult message
   (rollback ran but final state is suspect — operator inspection
   recommended). Skipped when AllThumbprints is empty (first-time
   deploy).

4. Success-path tempdir cleanup. New script tagged
   `# CERTCTL_CLEANUP` runs after a successful import to remove
   the snapshot tempdir on a best-effort basis. Failure here is
   non-fatal (debug log only).

5. Helper extraction. rollbackImport(ctx, snapshot, newThumbprint)
   + verifyRollback(ctx, snapshot) + cleanupSnapshot(ctx, snapshot)
   + parseSnapshotOutput are private methods/functions on
   Connector for clean test seams. Each script emits a unique
   `# CERTCTL_*` PowerShell comment tag so test mocks can match
   scripts deterministically — the snapshot/rollback/verify/cleanup
   scripts all reference Cert:\<store> paths, so the comment tags
   are the only deterministic substring under randomized map
   iteration.

DeploymentResult shape on failure:
- import OK, rollback OK   → Success=false, "PowerShell import
                              failed; rolled back" (clean
                              recoverable failure).
- import FAIL, rollback OK → same.
- rollback FAIL            → operator-actionable wrapped error
                              containing both errors; metadata
                              flags manual_action_required=true
                              and surfaces import_error /
                              rollback_error verbatim.

Tests added to wincertstore_test.go:
- TestWinCertStore_ImportFails_RemovesNewCert_RestoresOldFromSnapshot
  — happy rollback path with one same-Subject cert in the
  snapshot. Asserts rollback script contains Remove-Item for the
  new thumbprint AND Import-PfxCertificate referencing the
  snapshotted PFX path.
- TestWinCertStore_ImportFails_NoExistingSameSubject_RemovesNewCertOnly
  — snapshot has THUMB: lines but no SNAPSHOT: entries; rollback
  removes the new cert but does NOT call Import-PfxCertificate.
- TestWinCertStore_FriendlyNameFails_NewCertRemoved_OldCertsRestored
  — variant where the import script's failure originates from
  Set-ItemProperty FriendlyName; same rollback path. Asserts
  metadata.import_error preserves the FriendlyName-related
  PowerShell output for operator visibility.
- TestWinCertStore_ImportFails_RollbackAlsoFails_OperatorActionable
  — wrapped-error escalation. Asserts the error mentions both
  "PowerShell import failed" and "rollback also failed", and
  metadata flags manual_action_required=true.

Three existing tests (Success, ImportFailed, WithFriendlyName,
WithRemoveExpired) updated to match the new contract: success
path runs 3 PowerShell scripts (snapshot + import + cleanup),
import-failure path runs 4 (snapshot + import + rollback + verify),
and the import script lives at mock.scripts[1] not [0].

PowerShell injection note: the new cert's Subject DN is embedded
in the snapshot script as a single-quoted literal. Subject DNs can
contain apostrophes (e.g. CN=O'Reilly), so escapePowerShellSingleQuoted
doubles them per the PowerShell single-quoted-literal escape rule.
The export password and thumbprints come from
certutil.GenerateRandomPassword (alphanumeric only) and the cert's
SHA-1 thumbprint hex (alphanumeric); no escaping needed for those.

docs/deployment-atomicity.md L93 unchanged from today's text —
Bundle 1 doc-realignment hasn't shipped, so the "Get-ChildItem
snapshot for rollback" line was never softened. Post-Bundle-7 the
claim is honest (was aspirational pre-fix).

Verified locally (sandbox lacks staticcheck install due to disk
pressure; CI runs the full lint gate):
- gofmt -l ./internal/connector/target/wincertstore/  clean
- go vet ./internal/connector/target/wincertstore/  clean
- go build ./cmd/agent/...  clean
- go test -race -count=1 ./internal/connector/target/wincertstore/
  green

Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 7.
2026-05-02 18:13:40 +00:00
shankar0123 c222c8b57a ssh: fix staticcheck ST1008 — error is last return from restoreFromBackups
CI's golangci-lint run on commit 636de7f ("ssh: pre-deploy snapshot
+ reload-failure rollback") caught a staticcheck ST1008 violation:
restoreFromBackups returned (error, map[string]string) — error must
be the last return value per Go convention.

Reorder the return tuple to (map[string]string, error) and update
the single caller in DeployCertificate. No behavior change; pure
signature shuffle to satisfy the lint gate.

Verified locally:
- gofmt -l ./internal/connector/target/ssh/  clean
- go vet ./internal/connector/target/ssh/  clean
- go test -race -count=1 ./internal/connector/target/ssh/  green
2026-05-02 17:35:45 +00:00
shankar0123 636de7f6b5 ssh: pre-deploy snapshot + reload-failure rollback
Closes Bundle 6 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). Pre-fix,
DeployCertificate at ssh.go:201-316 wrote new cert/key/chain via
SFTP then ran the operator's reload command. If reload failed, the
new files stayed on the remote — partial-success state with no
rollback path. docs/deployment-atomicity.md L92 promised "Pre-deploy
SCP backup of remote files"; the code didn't deliver.

This commit:

1. Pre-deploy snapshot. Before any WriteFile, iterate the deploy's
   target paths (cert, key, optional chain). For each path:
   - StatFile to detect existence. errors.Is(err, os.ErrNotExist)
     means first-time deploy (rollback = Remove). Other stat
     errors bail out before any write happens.
   - ReadFile into an in-memory backups map[string][]byte keyed
     by remote path. Original mode captured into a parallel
     modes map for restore fidelity.

2. SSHClient interface evolution — three changes:
   - StatFile(path) (os.FileInfo, error) — was (int64, error).
     FileInfo carries Mode() needed for accurate restore. Existing
     fixture tests updated to call info.Size() instead of the
     bare size value.
   - ReadFile(path) ([]byte, error) — new method; SFTP Open + read
     via io.ReadAll. realSSHClient implements via sftpClient.Open.
   - Remove(path) error — new method; SFTP Remove. Used by the
     rollback path to clean up first-time-deploy partial state.

3. On-reload-failure rollback. Replace the bare error-return at
   L282-295 with restoreFromBackups + retry-reload escalation:
   - For paths in the snapshot map, WriteFile the original bytes
     with the original mode (0600 fallback if mode capture was
     incomplete).
   - For paths that didn't exist pre-deploy, Remove the new file.
   - Re-run the reload command (best-effort second attempt). If
     it succeeds, the target is back to pre-deploy state. If it
     fails, the remote is in pre-deploy file state but the daemon
     may be stuck — surface as wrapped error so the operator
     knows where to look.

4. DeploymentResult.Metadata gains backup_status_{cert,key,chain}
   so operators can see per-path snapshot state on both success
   ("snapshotted" / "no_pre_existing" / "n/a") and failure
   ("restored" / "removed" / "restore_failed" / "remove_failed").
   buildMetadataWithBackup helper centralises the metadata
   shape so success and failure paths emit a consistent set
   of keys.

5. Helper extraction. restoreFromBackups(ctx, paths, backups,
   modes) is a private method on Connector; returns the first
   error + per-key restore status map for clean test seams.

DeploymentResult shape on failure:
- rollback OK + retry-reload OK → Success=false, "reload command
  failed; rolled back to pre-deploy state" (clean recoverable
  failure; remote fully restored, daemon serving original cert).
- rollback OK + retry-reload FAIL → wrapped error noting "rolled
  back files; retry-reload also failed; daemon may need manual
  restart". Metadata flags daemon_state_unknown=true.
- rollback FAIL → operator-actionable wrapped error containing
  BOTH the reload error AND the rollback error; metadata flags
  manual_action_required=true.

Tests added to ssh_test.go (4 new tests, ~330 LOC):
- TestSSH_ReloadFails_FilesRestored — happy rollback path with
  pre-existing remote bytes for cert/key/chain. Asserts every
  path's last WriteFile call contains the captured backup bytes
  verbatim, no Remove calls fired (all paths had snapshots), and
  metadata reports backup_status=restored for each path.
- TestSSH_NoExistingCert_ReloadFails_NewCertRemoved — first-time
  deploy variant. StatFile returns os.ErrNotExist for every path;
  rollback Removes each written file but performs no WriteFile
  during restore (no backup to restore from). Asserts exactly 3
  WriteFile calls (deploy only) and 3 Remove calls (rollback).
- TestSSH_ReloadFails_RollbackAlsoFails_OperatorActionable —
  uses a writeOrderTrackingMock to fail the SECOND WriteFile to
  the cert path (i.e. the restore call, not the initial deploy).
  Asserts wrapped error contains both the reload error and the
  rollback error, and metadata flags manual_action_required=true.
- TestSSH_ReloadFails_RestoreThenSecondReloadFails — partial-
  recovery escalation. Rollback succeeds but the post-restore
  retry-reload fails. Asserts wrapped error mentions "rolled back
  files; retry-reload also failed" and metadata flags
  daemon_state_unknown=true.

Existing tests preserved by extending mockSSHClient with backward-
compatible per-path response maps (statByPath / readByPath /
writeFileErrByPath / executeErrSequence). Legacy global fields
(statFileSize / statFileErr / writeFileErr / executeErr) still
work when no per-path override matches, so TestValidateConfig_*
and TestDeployCertificate_Success_* don't need changes.

docs/deployment-atomicity.md L92 unchanged from today's text —
Bundle 1 doc-realignment hasn't shipped, so the "Pre-deploy SCP
backup of remote files" line was never softened. Post-Bundle-6
the claim is honest (was aspirational pre-fix).

Verified locally (sandbox lacks staticcheck install due to disk
pressure; CI runs the full lint gate):
- gofmt -l ./internal/connector/target/ssh/  clean
- go vet ./internal/connector/target/ssh/  clean
- go build ./internal/connector/target/ssh/...  clean
- go build ./cmd/agent/...  clean
- go test -race -count=1 ./internal/connector/target/ssh/  green

Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 6.
2026-05-02 17:13:38 +00:00
shankar0123 30daadbe81 iis: pre-deploy binding snapshot + on-failure rollback
Closes Bundle 5 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). Pre-fix,
DeployCertificate at iis.go:235-436 imported the cert via
Import-PfxCertificate (atomic at cert-store level) then ran a
separate PowerShell script for the SNI binding update. If the
binding script failed, the new cert was orphaned in the store AND
the old binding stayed pointed at the old thumbprint.
docs/deployment-atomicity.md L91 promised "explicit pre-deploy
backup + post-rollback re-import"; the code didn't deliver.

This commit:

1. Pre-deploy snapshot. snapshotOldBinding runs Get-WebBinding
   before the import; parses the bound SSL thumbprint into a local
   `oldThumbprint` variable. Empty = first-time binding (no
   rollback target).

2. On-failure rollback script. When the binding-update Execute
   returns error, rollbackBinding runs a single PowerShell script
   that:
   - Remove-Item Cert:\LocalMachine\<store>\<newThumbprint> (delete
     the cert we just imported but couldn't bind).
   - If oldThumbprint != "", AddSslCertificate('<oldThumbprint>',
     ...) to re-bind the old cert. Falls through to New-WebBinding
     + AddSslCertificate when the old binding entry is also gone.

3. Post-rollback verification. verifyRollback re-reads
   Get-WebBinding; asserts the bound thumbprint matches
   oldThumbprint. On mismatch, warn in the DeploymentResult
   message — the rollback ran but final state is suspect, operator
   inspection required. Skipped when oldThumbprint == "" (no
   binding to verify against).

4. Helper extraction. snapshotOldBinding / rollbackBinding /
   verifyRollback are private methods on Connector for clean test
   seams. Each emits a unique `# CERTCTL_*` PowerShell comment tag
   so test mocks can match scripts deterministically — multiple
   scripts call Get-WebBinding so substring matching otherwise
   collides under Go's randomized map iteration order.

DeploymentResult shape on failure:
- rollback OK   → Success=false, Message="binding update failed;
                  rolled back", clean error.
- rollback FAIL → Success=false, wrapped error containing both
                  binding error and rollback error; metadata
                  flags manual_action_required=true and surfaces
                  rollback_error / binding_error verbatim.

Tests added to iis_test.go:
- TestIIS_BindingUpdateFails_RemovesNewCert_RebindsOld — happy
  rollback path. Mock executor queued with snapshot →
  OLD_THUMBPRINT:abc123, import OK, binding fails, rollback →
  REBOUND_EXISTING. Asserts rollback script contains both
  Remove-Item for the new thumbprint AND
  AddSslCertificate('abc123', ...).
- TestIIS_BindingUpdateFails_NoOldBinding_RemovesNewCertOnly —
  first-time deploy variant. Snapshot returns NO_OLD_BINDING;
  rollback removes the new cert but does NOT call
  AddSslCertificate; verify script never runs.
- TestIIS_BindingUpdateFails_RollbackAlsoFails_OperatorActionable
  — wrapped-error escalation. Asserts the returned error mentions
  both `binding update failed` and `rollback also failed`, and
  metadata flags manual_action_required=true.

Two existing tests (TestIISConnector_DeployCertificate_Success and
…_SNIEnabled) updated to expect 3 commands (snapshot, import,
binding) and to look for the binding script at commands[2].

docs/deployment-atomicity.md L91 unchanged from today's text — the
"Already explicit pre-deploy backup + post-rollback re-import"
claim is now honest. (Bundle 1 doc-realignment hasn't shipped yet,
so there's no softened-pending claim to restore.)

Verified locally (sandbox lacks staticcheck install due to disk
pressure, ran via go vet + go test -race; CI runs the full lint
gate):
- gofmt -l ./internal/connector/target/iis/  clean
- go vet ./internal/connector/target/iis/...  clean
- go build ./internal/connector/target/iis/...  clean
- go test -race -count=1 ./internal/connector/target/iis/  green

Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 5.
2026-05-02 16:58:01 +00:00
shankar0123 b767f579ef traefik: refactor to single deploy.Apply Plan (all-files atomicity + rollback)
Closes Bundle 4 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). Pre-fix,
DeployCertificate called deploy.AtomicWriteFile twice — once for
cert at L123, once for key at L131 — instead of bundling both into
a single deploy.Plan and calling deploy.Apply. Three downstream
hazards:

1. If cert write succeeds and key write fails, the cert is already
   on disk. The in-line best-effort cert rollback at L137-141 had
   no error wrapping and the dedicated rollbackCertAndKey helper
   only restored the cert.

2. Idempotency was per-file, not all-files. The verify gate
   (if !certRes.Idempotent) skipped verify when cert was unchanged
   but key was new — exactly the shape that produces a fresh key on
   disk + a stale fingerprint served, and zero alarm.

3. Verify-failure rollback only handled the cert. Key was left in
   whatever state the deploy reached.

This commit aligns Traefik with the canonical NGINX/Apache/HAProxy/
Postfix template:

- buildPlan() constructs deploy.Plan{Files: []{cert, key}}.
- deploy.Apply runs it all-or-nothing. SHA-256 idempotency is
  all-files (Result.SkippedAsIdempotent).
- No PreCommit (Traefik has no validate-with-target command —
  file watcher absorbs config errors).
- No PostCommit (file watcher auto-reloads on rename).
- runPostDeployVerify retained as-is (TLS handshake + SHA-256
  fingerprint compare + retry/backoff).
- On verify failure, restoreFromBackups iterates
  res.BackupPaths and rewrites each destination via
  AtomicWriteFile{SkipIdempotent: true, BackupRetention: -1}.

Removed:
- The legacy rollbackCertAndKey helper (cert-only restore).
- The inline best-effort cert-rollback in DeployCertificate.

Tests added to traefik_atomic_test.go:
- TestTraefik_Atomic_KeyWriteFails_CertRollsBack — regression guard
  for the original two-AtomicWriteFile bug. Pre-writes a sentinel
  cert; sets the key path inside a read-only subdir so the key
  write must fail; asserts the cert on disk still contains the
  sentinel bytes (Apply's all-or-nothing rollback).

- TestTraefik_Atomic_AllFilesIdempotent — two subtests:
    both_match_skips: pre-writes cert + key matching what Traefik
      would write; asserts idempotent=true AND probe is never
      called.
    cert_match_key_new_runs_verify: pre-writes only the cert; key
      is new; asserts idempotent=false AND probe IS called once.
      Pre-fix per-file gate would have leaked through and skipped
      the verify here.

- TestTraefik_Atomic_VerifyMismatch_BothFilesRollBack — pre-writes
  sentinel cert + key; stub probe returns wrong fingerprint;
  asserts BOTH files are restored to sentinel bytes after the
  rollback fires. Pre-fix rollbackCertAndKey only restored the
  cert; the key would still be the new bytes.

The pre-existing TestTraefik_Atomic_VerifyMismatch_Rollback (which
asserted only the cert restore) is left intact — it's a strict
subset of the new BothFilesRollBack assertion and serves as a
narrower regression guard.

docs/deployment-atomicity.md L84 unchanged — operator-facing claim
("atomic-write only; ValidateOnly returns sentinel") stays accurate.

Verified locally:
- gofmt -l ./internal/connector/target/traefik/ clean
- go vet ./... clean
- staticcheck ./internal/connector/target/traefik/... clean
- go build ./... clean
- go test -race -count=1 ./internal/connector/target/traefik/...
  green (pre-existing tests + 3 new = 13 test functions; 14 with
  the AllFilesIdempotent subtests)
- go test -short -count=1 ./internal/connector/target/... green
  (no cross-connector regressions)

Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 4.
2026-05-02 16:16:25 +00:00
shankar0123 febf50090b envoy: atomic SDS JSON write + post-deploy watcher pickup poll
Closes Bundle 3 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). The audit
ranked this fix #3 by acquirer impact behind the K8s real client (#1)
and the docs realignment (#2 / Bundle 1).

Two production-grade gaps closed:

1. SDS JSON config write was non-atomic. Cert/key/chain at envoy.go
   L155/L168/L183 went through deploy.AtomicWriteFile (atomic + backups
   + ownership preservation), but the SDS JSON at L260 went through
   os.WriteFile directly. A power loss / OOM / process-kill mid-write
   of the SDS JSON produces a torn file Envoy cannot parse, and
   Envoy's file-based SDS watcher refuses to load any cert (not just
   the rotating one) until the JSON is repaired by hand. Replaced with
   deploy.AtomicWriteFile and threaded ctx through writeSDSConfig.

2. No watcher pickup confirmation before returning success. Pre-fix,
   DeployCertificate returned the moment file writes completed.
   Envoy's SDS watcher is asynchronous; a caller running post-deploy
   TLS verify immediately after DeployCertificate could see Envoy
   still serving the old cert (watcher latency, load-balanced replica
   hit one that hadn't reloaded yet). Added the canonical post-deploy
   verify pattern (mirrors nginx.go::runPostDeployVerify L416): probe
   seam + retry/backoff + SHA-256 fingerprint compare against
   request.CertPEM. On verify failure, restore from per-file backups
   via the new restoreFromBackups helper. Envoy has no PostCommit
   reload to re-run; the watcher auto-reloads on the restored files.

Config additions to envoy.Config (mirror nginx.Config L84-93):
- PostDeployVerify *PostDeployVerifyConfig (Enabled, Endpoint, Timeout)
- PostDeployVerifyAttempts int (default 3 in runPostDeployVerify)
- PostDeployVerifyBackoff time.Duration (default 2s)
- BackupRetention int (mirrors nginx; passed to AtomicWriteFile per file)

Default behaviour unchanged for callers that don't set
PostDeployVerify — verify is opt-in. nil or Enabled=false skips it
entirely.

Probe seam: c.probe = tlsprobe.ProbeTLS at construction; tests inject
via the new SetTestProbe method. Same shape NGINX uses (nginx.go:130);
also mirrors the existing Traefik SetTestProbe at traefik.go:62.

WriteResult retention: every AtomicWriteFile call now retains its
*deploy.WriteResult in a local []*deploy.WriteResult slice so the
rollback path can restore from BackupPath across all four files
(cert, key, chain, SDS JSON), not just the cert. Pre-fix the cert's
WriteResult was discarded.

restoreFromBackups (envoy.go new): iterates the WriteResults from a
successful per-file pass, rewrites each non-idempotent destination
from its BackupPath via AtomicWriteFile{SkipIdempotent:true,
BackupRetention:-1}. The -1 prevents backup-of-the-backup pollution.
For files that didn't exist pre-deploy (BackupPath == ""), restore =
remove. Mirrors nginx.go::rollbackToBackups (L487-515) with the
reload step elided.

Idempotency gate: shouldRunVerify returns true unless EVERY
WriteResult was Idempotent — same all-files semantics NGINX gets
from res.SkippedAsIdempotent. Pre-fix Envoy had no verify at all,
so there was no gate to get wrong; this introduces the correct
all-files shape from the start.

Tests added to envoy_atomic_test.go:
- TestEnvoy_Atomic_SDSConfigWriteIsAtomic — pre-writes a sentinel
  SDS JSON, runs DeployCertificate, asserts a backup file with
  deploy.BackupSuffix appears alongside the new sds.json (proves
  AtomicWriteFile is now in the SDS path).
- TestEnvoy_Atomic_WatcherPickupRetries — stub probe returns wrong
  fingerprint on attempts 1+2 and correct on attempt 3; deploy
  succeeds; probe called exactly 3 times.
- TestEnvoy_Atomic_WatcherPickupAllAttemptsFail_RollsBack — pre-writes
  SENTINEL bytes for cert+key, stub probe always wrong; deploy
  returns wrapped error AND the destination files contain the
  sentinel bytes (rollback restored).
- TestEnvoy_Atomic_PostDeployVerifyDisabledByDefault — Config with
  nil PostDeployVerify; asserts probe is never called (opt-in
  default preserved).

A small certPEMFingerprint helper added to the test file mirrors the
production envoy.certPEMToFingerprint (which is package-private —
external tests can't call it).

docs/deployment-atomicity.md L87 row already documents
"TLS handshake | atomic-write replaces os.WriteFile" — pre-fix the
claim was aspirational (verify happened in the agent verify-and-report
path, not the connector; SDS JSON wasn't atomic). Post-fix the claim
is honest. No doc change required.

Verified locally:
- gofmt -l ./internal/connector/target/envoy/ clean
- go vet ./internal/connector/target/envoy/... clean
- staticcheck ./internal/connector/target/envoy/... clean
- go build ./... clean
- go test -race -count=1 ./internal/connector/target/envoy/... green
  (5 pre-existing tests + 4 new = 9 total)
- go test -short -count=1 ./internal/connector/target/... green

Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 3.
2026-05-02 16:08:20 +00:00
shankar0123 475421457f fix(test): TestBoundedFanOut_SkipsAgentRoutedDeployments race on seenIDs slice
CI race detector flagged TestBoundedFanOut_SkipsAgentRoutedDeployments
on commit 35e18bf (audit fix #9). The test's `work` closure was
appending to a plain []string slice from worker goroutines without
synchronisation:

    var seenIDs []string
    work := func(ctx context.Context, job *domain.Job) error {
        seen.Add(1)
        seenIDs = append(seenIDs, job.ID)  // race
        return nil
    }

atomic.Int64 covered the count assertion but the slice header itself
is the racing memory — race detector caught both the read+write race
on the slice header and the runtime.growslice path on append.

Fix: protect seenIDs with a sync.Mutex. The slice is only used in
the failure-message branch (`t.Errorf` ids=%v formatting), so the
contention is irrelevant to performance — correctness only.

Also locked around the read in the t.Errorf format-args evaluation,
since that read happens AFTER boundedFanOut returns (and Wait()
inside boundedFanOut synchronizes the worker goroutines), but the
explicit Lock/Unlock makes the synchronisation visible without
depending on the implicit happens-before from Wait.

The other five tests in the file (TestBoundedFanOut_CapHolds,
_AllJobsRun, _CtxCancelInterrupts, _FailedJobsCounted,
TestSetRenewalConcurrency_NormalizesNonPositive) only mutate
atomic.Int64 counters from worker goroutines, so they were
already race-clean.

Verified locally: go test -race -count=1 -run
'TestBoundedFanOut|TestSetRenewalConcurrency' ./internal/service/...
green.
2026-05-02 14:34:48 +00:00
shankar0123 a22a1be962 globalsign,entrust: cache mTLS keypair with mtime-based reload
Closes the #10 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Pre-fix, GlobalSign reloaded the mTLS cert/key from
disk on every API call (globalsign.go::getHTTPClient) and Entrust
loaded once in ValidateConfig with no rotation handling — both shapes
were broken for different reasons. Per-call disk reads under a 100-
cert renewal sweep meant 200 file opens / parses / tls.X509KeyPair
calls in flight, each adding 5–50ms of latency for nothing; the
single-load Entrust shape served stale credentials forever after a
cert rotation, requiring a process restart.

This commit:

- Adds a new shared package internal/connector/issuer/mtlscache/
  with a Cache type holding a parsed tls.Certificate plus a
  precomputed *http.Transport. RWMutex serialises reloads; reads
  are lock-free in the hot path (read lock briefly held to copy
  out the *http.Client pointer, then released — the HTTP request
  itself happens with no lock held, per the audit prompt's anti-
  pattern about holding the write lock across an API call).

- RefreshIfStale stats the cert file; if mtime advanced beyond
  the last load, the keypair is re-parsed and the transport is
  rebuilt. The fast path (mtime unchanged) takes the read lock
  for the comparison and returns immediately. Double-checked-lock
  pattern (read lock → stat → release → write lock → re-stat)
  prevents two callers who observed the same stale mtime from
  both reloading.

- Options.TLSConfigBuilder lets the caller customise the *tls.Config
  built around the parsed leaf certificate. GlobalSign uses this
  to inject the ServerCAPath-pinning RootCAs pool that
  buildServerTLSConfig already produces; entrust uses the default
  builder.

- New() performs the initial load so a broken cert path fails
  fast at construction rather than at first API call.

- GlobalSign.Connector gains an mtls field. getHTTPClient now:
  (1) preserves the test-mode short-circuit when httpClient has
      a non-nil Transport;
  (2) preserves the bare-default-client short-circuit when cert
      paths aren't configured;
  (3) lazy-builds the cache on the first call so the constructor
      stays cheap;
  (4) calls RefreshIfStale on every subsequent call.
  The error wrap preserves the substring "client certificate" so
  existing TestGlobalsign_GetHTTPClient_MTLSPathConfigured_LoadsKeyPair
  keeps its assertion.

- Entrust.Connector gains an mtls field plus a new getHTTPClient
  helper mirroring GlobalSign's shape. The three IssueCertificate /
  RevokeCertificate / pollEnrollmentOnce sites that previously hit
  c.httpClient.Do(req) directly now route through getHTTPClient,
  which falls through to the test-injected client (same logic as
  GlobalSign) and otherwise serves the cached mTLS client. The
  legacy ValidateConfig flow that pre-built c.httpClient with its
  own transport stays intact — its transport wins because
  getHTTPClient short-circuits when c.httpClient.Transport != nil.

- Tests at internal/connector/issuer/mtlscache/cache_test.go cover:
  * fail-fast on missing paths (constructor input validation)
  * load on construction (positive + negative)
  * NoReloadWhenMtimeStable — 100 RefreshIfStale calls, LoadedAt
    must stay equal to the constructor's stamp (the load-bearing
    regression guard against per-call disk reads)
  * ReloadsOnMtimeAdvance — os.Chtimes forward, next refresh
    must observe the new LoadedAt (the load-bearing regression
    guard for rotation-without-process-restart)
  * StatErrorBubbles — missing cert file surfaces as an error
    rather than silently serving stale credentials
  * ConcurrentNoRace — 100 goroutines × 50 iterations under
    -race; no race detected, all calls succeed
  * TLSConfigBuilderUsed — custom builder is invoked at New AND
    on reload; verifies MinVersion=TLS1.3 takes effect
  * ClientHonoursTimeout — Options.HTTPTimeout reaches the
    constructed *http.Client

- docs/connectors.md GlobalSign + Entrust sections each gain an
  "mTLS keypair caching (audit fix #10)" paragraph documenting the
  steady-state caching, mtime-based rotation contract, and
  operator workflow (mv -f new.crt /etc/certctl/.../client.crt).

Acquirer impact: removes the per-call disk-read latency floor and
makes operator-driven cert rotation a no-restart event. Combined
with audit fix #9's bounded scheduler concurrency, the renewal
sweep's hot path now has predictable steady-state cost: capN
concurrent goroutines, each reusing the cached keypair, no per-
call file I/O.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- go test -race -count=1 ./internal/connector/issuer/mtlscache/...
  green (8 tests)
- go test -count=1 -short across globalsign / entrust / sectigo /
  ejbca / mtlscache / connector packages: green

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #10. Closes the audit's full Top-10 list (fixes #1-10
all shipped to master).
2026-05-02 14:32:59 +00:00
shankar0123 35e18bfc56 scheduler: bound renewal concurrency via CERTCTL_RENEWAL_CONCURRENCY
Closes the #9 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Pre-fix, JobService.ProcessPendingJobs ran every
claimed job sequentially in a single goroutine: safe but slow, and
operators with large fleets had no lever to dial throughput up.
Switching to fire-and-forget per-job goroutines would have unbounded
the upstream-CA call rate and tripped DigiCert / Entrust / Sectigo
rate limits — certctl's response to 429 was to retry on the next
tick, re-fanning out the same calls and digging deeper into the
limit. Operators need a knob.

This commit:

- Adds CERTCTL_RENEWAL_CONCURRENCY env var (default 25) loaded via
  the existing getEnvInt pattern in internal/config/config.go.
  Documented inline as the cap for the per-tick renewal/issuance/
  deployment goroutine fan-out, with operator-tuning guidance:
  permissive upstream limits + large fleets (>10k certs) → 100;
  strict limits or async-CA-heavy fleets → 25 or lower.

- Wires golang.org/x/sync/semaphore.Weighted around the per-job
  goroutine launch in JobService.ProcessPendingJobs. Acquire(ctx, 1)
  is the load-bearing piece — it BLOCKS the loop when at the cap,
  providing real backpressure rather than fire-and-forget. The
  fan-out is split into processPendingJobsSequential (legacy,
  preserved for unit-test wiring that doesn't call
  SetRenewalConcurrency) and processPendingJobsConcurrent (production,
  delegates to a generic boundedFanOut helper).

- boundedFanOut takes the per-job work as a closure so the cap can
  be tested directly without standing up the renewal/deployment
  service graph. processed/failed counters use atomic.Int64 to
  avoid mutex overhead on every job completion; final log line
  reads both AFTER wg.Wait so the counts reflect every dispatched
  job. ctx-aware Acquire ensures a shutdown ctx cancel interrupts
  the dispatch loop promptly; in-flight goroutines drain via Wait
  before the function returns so no goroutine outlives the
  scheduler tick.

- shouldSkipJob extracted as a package-private helper so the
  agent-routed-deployment skip logic is shared between the
  sequential and concurrent paths byte-for-byte (the audit prompt's
  "channel-based semaphore without ctx-aware acquire" anti-pattern
  is explicitly avoided — semaphore.Weighted.Acquire returns on ctx
  done; channel <- struct{}{} would block forever).

- SetRenewalConcurrency setter on JobService normalises ≤0 to 1.
  semaphore.NewWeighted(0) constructs a semaphore that blocks every
  Acquire forever; the normalisation prevents a misconfigured env
  var from wedging the scheduler.

- cmd/server/main.go wires SetRenewalConcurrency(cfg.Scheduler.
  RenewalConcurrency) on the freshly-built jobService, immediately
  after SetAuditService. Production deployments always take the
  bounded path; tests that build JobService directly via
  NewJobService keep their strict-sequential behaviour because
  renewalConcurrency is the zero value.

- Tests in internal/service/job_concurrency_test.go:
  * TestBoundedFanOut_CapHolds — primary regression guard. 50 jobs
    × 50ms work × cap=5 → asserts peak in-flight never exceeds 5
    AND reaches 5 at least once (catches both upper-bound regressions
    and gates that incorrectly cap below the configured value).
    Lock-free max via CompareAndSwap so the measurement instrument
    doesn't itself constrain concurrency.
  * TestBoundedFanOut_AllJobsRun — lower-bound: every non-skipped
    job is dispatched.
  * TestBoundedFanOut_SkipsAgentRoutedDeployments — pins the
    shouldSkipJob contract.
  * TestBoundedFanOut_CtxCancelInterrupts — ctx cancellation
    interrupts a stuck fan-out within the timeout budget.
  * TestBoundedFanOut_FailedJobsCounted — per-job errors don't
    abort the fan-out.
  * TestSetRenewalConcurrency_NormalizesNonPositive — ≤0 → 1 fail-safe
    pinned across negative/zero/positive inputs.

- docs/features.md: scheduler-loop table augmented with the
  concurrency-cap env-var pointer alongside the job-processor row.

- docs/architecture.md: Concurrency Safety section gains a paragraph
  explaining the cap, the operator-tuning guidance, the ctx-aware
  Acquire semantics, and the audit reference.

Operator-facing impact: the first big renewal sweep no longer
takes down the upstream CA's rate-limit budget. Existing deployments
get the bounded path automatically (default 25); operators can
override via env var without code changes.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- go test -short -count=1 across service / scheduler / config /
  integration: green
- Six new tests under TestBoundedFanOut* + TestSetRenewalConcurrency*:
  green

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #9.
2026-05-02 14:12:30 +00:00
shankar0123 fefa5a5fd7 acme: support serial-only revocation via local cert-version lookup
Closes the #7 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Pre-fix, ACME RevokeCertificate at acme.go:L519-L529
returned the literal error "ACME revocation by serial not supported in
V1; provide certificate DER". RFC 8555 §7.6 genuinely requires the
cert DER bytes (not just the serial), but a CLM platform's job is to
abstract over that limitation. Operators routinely have only the
serial in hand: lost PEM, rotated key, GUI revoke action driven by a
row in the certs list.

This commit:

- Adds CertificateLookupRepo interface at the ACME connector boundary
  (connector boundary, NOT a service/repository import — the connector
  accepts whatever satisfies the shape). Production wiring in
  cmd/server/main.go injects the postgres CertificateRepository; tests
  inject a fake.

- Adds CertificateRepository.GetVersionBySerial(ctx, issuerID, serial)
  + interface declaration in repository/interfaces.go, returning the
  certificate_versions row whose SerialNumber matches, scoped to the
  issuer via JOIN on managed_certificates. Mirrors the existing
  GetByIssuerAndSerial shape but returns the version (where PEMChain
  lives). Per RFC 5280 §5.2.3 the issuer scope is required for
  determinism.

- Adds SetCertificateLookup + SetIssuerID setters on *acme.Connector.
  Mirror the pattern local.Connector already uses for OCSP responder
  wiring. Both must be wired before serial-only revoke works;
  unwired state falls back to a more actionable error pointing at the
  wiring requirement (the historical "not supported" wording is
  retired).

- Rewrites RevokeCertificate end-to-end: lookup → empty-PEM check →
  pem.Decode → block.Type == "CERTIFICATE" check → ensureClient →
  golang.org/x/crypto/acme.Client.RevokeCert(ctx, accountKey, der,
  reasonCode). RFC 8555 §7.6 case 1 (revocation request signed with
  account key) — the same account key issued the cert, so authority
  is intrinsic. The not-found path returns an actionable operator-
  facing error pointing at the local-store requirement.

- Adds mapRevocationReason translating RFC 5280 §5.3.1 reason strings
  (unspecified, keyCompromise, cACompromise, affiliationChanged,
  superseded, cessationOfOperation, certificateHold, removeFromCRL,
  privilegeWithdrawn, aACompromise) into golang.org/x/crypto/acme.
  CRLReasonCode. Accepts canonical camelCase + underscore_lower +
  ALL_CAPS_UNDERSCORE. Nil reason → 0 (unspecified). Unknown reason
  errors rather than silently demoting (operators rely on the reason
  for compliance reporting).

- Wiring update in service/issuer_registry.go: SetACMECertLookup
  setter on the registry; Rebuild type-asserts *acme.Connector and
  calls SetCertificateLookup + SetIssuerID, mirroring the existing
  *local.Connector branch. cmd/server/main.go calls
  issuerRegistry.SetACMECertLookup(certificateRepo) immediately after
  SetIssuanceMetrics — the postgres repo satisfies the interface via
  GetVersionBySerial.

- Tests:
  * acme_revoke_test.go (new): TestRevokeCertificate_NoCertLookupWired,
    TestRevokeCertificate_NoIssuerIDWired,
    TestRevokeCertificate_LookupReturnsNotFound (operator-facing
    "may not have been issued through certctl" hint pinned),
    TestRevokeCertificate_LookupArbitraryError,
    TestRevokeCertificate_VersionPEMEmpty (corrupt-row guard),
    TestRevokeCertificate_PEMMalformed_NoBlock,
    TestRevokeCertificate_PEMMalformed_WrongType (PRIVATE KEY block
    rejected as not a CERTIFICATE).
  * TestMapRevocationReason_TableDriven: full RFC 5280 reason set
    plus camelCase / underscore / ALL-CAPS variants plus
    nil-reason and unknown-reason cases.
  * acme_failure_test.go: renamed TestRevokeCertificate_AlwaysError
    → TestRevokeCertificate_UnwiredCertLookupFallback; the test
    still exercises the same backward-compat branch but now
    asserts the new "CertificateLookup wiring" error wording.

- Mock-repo updates (3 sites): mockCertificateRepository in
  internal/integration/lifecycle_test.go, mockCertRepo in
  internal/service/testutil_test.go, mockCertRepoWithGetError in
  internal/service/shortlived_test.go each gain a GetVersionBySerial
  implementation that mirrors the GetByIssuerAndSerial logic but
  returns the version row.

- docs/connectors.md ACME section: new "Revocation by serial number"
  subsection covering the workflow, the local-store requirement
  (cert was issued through certctl, not imported), the reason-code
  mapping with the three accepted spelling variants, and a pointer
  to the audit reference.

Out of scope (intentional, per spec):

- Recovering the DER from outside the local cert store (CT logs,
  CSR + signature reconstruction). If the cert wasn't issued through
  certctl, revoke-by-serial via certctl isn't possible.
- Revocation via the cert's private key (RFC 8555 §7.6 case 2). The
  account-key path covers all certctl-issued certs because the same
  account key issued them.
- Pebble-backed integration test for the happy path. Pebble integration
  is the right home for that — the unit tests in this commit pin all
  failure-mode branches before the network call, and the wiring
  branch in Rebuild is exercised by the existing
  TestIssuerRegistryRebuild paths.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- go test -short -count=1 across connector, service, repository,
  integration, api/middleware, api/handler: green

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #7.
2026-05-02 13:09:30 +00:00
shankar0123 2a384c690e secret: migrate EJBCA / GlobalSign / Sectigo credentials to *secret.Ref (Phase 2)
Phase 2 of the #6 acquisition-readiness fix from the 2026-05-01 issuer
coverage audit. Phase 1 (commit 633a10a) shipped the secret.Ref opaque
credential type with PBKDF2-derived key, ChaCha20-Poly1305 envelope,
String/MarshalJSON redaction to "[redacted]", and the Use callback
that zero-fills the per-call buffer after the consumer returns.

This commit applies the type to the three connectors flagged by the
audit and adds the JSON-roundtrip glue that the production factory
path needs.

Shared (internal/secret/):

- Add UnmarshalJSON on *Ref so json.Unmarshal of a stored config
  blob (issuerfactory.NewFromConfig) parses the bytes-as-string into
  NewRefFromString without callers having to know the field type
  changed. Null and missing keys leave the receiver nil; non-string
  payloads (numbers, bools) are rejected with a typed error. Pinned
  by TestRef_UnmarshalJSON: string_value, null, missing_key,
  number_rejected, roundtrip_marshal_then_unmarshal (the round-trip
  goes through "[redacted]" intentionally — JSON-marshal-then-
  unmarshal of a Config with secrets is NOT a supported test pattern;
  callers that construct a rawConfig must use a JSON literal with
  the real values).

Per-connector migration:

- EJBCA (ejbca.go): Config.Token: string → *secret.Ref. ValidateConfig
  empty-check uses Token.IsEmpty() (nil-safe). setAuthHeaders rewritten
  to call Token.Use; the Bearer header string is built inside the
  callback and the buffer is zeroed on return. mTLS path is
  unaffected.

- GlobalSign (globalsign.go): Config.APIKey + Config.APISecret: string
  → *secret.Ref. Both ValidateConfig empty-checks use IsEmpty().
  Extracted setAuthHeaders helper consolidates the four duplicated
  triple-Set sites (ValidateConfig probe, IssueCertificate,
  RevokeCertificate, pollCertificateOnce) so any future header-shape
  change applies once. ValidateConfig now pulls from the local cfg
  (post-Unmarshal) so the helper takes a *Config rather than the
  receiver — needed because ValidateConfig writes the validated cfg
  onto c.config only AFTER the probe succeeds.

- Sectigo (sectigo.go): Config.Login + Config.Password: string →
  *secret.Ref. CustomerURI stays plain string (org identifier, not
  a credential). setAuthHeaders rewritten to call Login.Use +
  Password.Use; ValidateConfig's inline header writes use the same
  pattern (the ValidateConfig probe writes to a local cfg, not
  c.config, so it can't share setAuthHeaders without rewiring — the
  inline form is fine, kept consistent in shape).

Test migration:

- ejbca_test.go, ejbca_failure_test.go, ejbca_stubs_test.go: bulk
  Token: "X" → Token: secret.NewRefFromString("X") via sed; secret
  import added.
- globalsign_test.go, globalsign_failure_test.go: same pattern for
  APIKey + APISecret.
- sectigo_test.go, sectigo_failure_test.go: same pattern for Login +
  Password.

Two tests (TestGlobalSign_ServerTLSConfig/PinnedCA_TrustsExpectedServer
and TestSectigoConnector/ValidateConfig_Success) used to construct
rawConfig via json.Marshal(config) → ValidateConfig(rawConfig). After
the migration, json.Marshal redacts *secret.Ref to "[redacted]" by
design, so the roundtripped rawConfig wrote "[redacted]" as the
actual header value and the mock server's auth-header check 403'd.
Both tests now build rawConfig as a JSON literal (the production-
shape input — the factory path always feeds rawConfig from the DB
or env, never from json.Marshal of an in-memory Config). The new
tests have a comment explaining the trap so the next person who
adds a similar test sees the pattern.

Out of scope (intentional):

- The `internal/config/config.SectigoConfig` / `GlobalSignConfig` /
  `EJBCAConfig` env-var-loader structs are still plain strings —
  those types are the env-load shape, not the steady-state runtime
  shape. The seed path in service/issuer.go json-marshals them into
  a map[string]interface{} which the factory then UnmarshalJSON's
  into the connector Config; the new UnmarshalJSON on *Ref handles
  the conversion at the boundary.
- DigiCert.APIKey + Vault.Token are still plain strings; Phase 3
  will pick them up. The audit explicitly named EJBCA / GlobalSign /
  Sectigo as the Phase 2 scope (RESULTS.md L633).

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck across all four packages clean
- go test -short -count=1 across secret, ejbca, globalsign, sectigo,
  issuerfactory, service, api/handler: green

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #6 — Phase 2.
2026-05-02 12:53:58 +00:00
shankar0123 0509790325 asyncpoll: refactor Sectigo / Entrust / GlobalSign to bounded polling (Phase 2)
Phase 2 of the #5 acquisition-readiness fix from the 2026-05-01 issuer
coverage audit. Phase 1 (commit 711265b) shipped the shared asyncpoll
package and refactored DigiCert as the reference. This commit applies
the same pattern to the remaining three async-CA connectors and adds
the operator-facing docs.

Per-connector refactors:

- Sectigo (sectigo.go): GetOrderStatus now wraps pollEnrollmentOnce in
  asyncpoll.Poll. The collectNotReady sentinel (cert approved by SCM
  but not yet retrievable from the collect endpoint) maps to
  StillPending and rides the backoff schedule rather than the prior
  "return pending immediately" branch. Added isPermanentStatusError
  helper to distinguish transient HTTP errors (5xx / 429 / network)
  from permanent ones (4xx / parse failure) — the wrapped checkStatus
  errors get triaged at the poll closure boundary.

- Entrust (entrust.go): GetOrderStatus wraps pollEnrollmentOnce. The
  AWAITING_APPROVAL status maps to StillPending; operators using
  approval-pending workflows where humans approve enrollments should
  bump CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS to 86400 (24h) so a
  single scheduler tick can wait through the approval window. The
  default 10-minute deadline matches the other three connectors.

- GlobalSign (globalsign.go): GetOrderStatus wraps pollCertificateOnce.
  GlobalSign tracks orders by serial number rather than order ID, but
  the polling shape is identical to the other three. Status-code
  triage matches DigiCert: 4xx (not 429) is permanent, 5xx / 429 /
  network is transient.

Per-connector Config field added:
- DigiCert.PollMaxWaitSeconds (env CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS)
- Sectigo.PollMaxWaitSeconds (env CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS)
- Entrust.PollMaxWaitSeconds (env CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS)
- GlobalSign.PollMaxWaitSeconds (env CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS)

internal/config/config.go env-var loaders updated for all four. Default
is 600 seconds (10 minutes); zero falls back to the asyncpoll package
default.

Test-helper updates: every existing test that exercises the pending
branch (collectNotReady, AWAITING_APPROVAL, status="pending", etc.)
now sets PollMaxWaitSeconds=1 in its Config so the test doesn't block
on the production-default 10-minute deadline. Tests that exercise
permanent-error branches (404, 401, malformed JSON, etc.) continue
to return immediately.

Test sites updated:
- buildSectigoConnector helper + GetOrderStatus_CollectNotReady test
- buildEntrustConnector helper + GetOrderStatus_Pending test
- buildGlobalsignConnector helper + GetOrderStatus_Pending test +
  the GetHTTPClient_NoMTLSCertPaths test (network failure now rides
  the backoff schedule rather than returning immediately)

Documentation:
- docs/async-polling.md: new operator reference covering the backoff
  schedule, status-code triage, the four env vars, failure modes, and
  where the implementation lives. Audit blocker citation included.
- docs/connectors.md: per-issuer sections for DigiCert, Sectigo,
  Entrust, GlobalSign each gain the PollMaxWaitSeconds env var row
  and a cross-link to async-polling.md.

Lint cleanup: simplified the isPermanentStatusError branch to satisfy
staticcheck S1008 (single-line return for a final boolean check).

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 across all 4 connector packages + config + asyncpoll: green

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #5 — Phase 2.
2026-05-02 02:41:36 +00:00
shankar0123 633a10aa4e secret: add Ref opaque-credential abstraction (Phase 1)
Phase 1 of the #6 acquisition-readiness fix from the 2026-05-01 issuer
coverage audit. Pre-fix, GlobalSign / EJBCA / Sectigo store API keys
/ OAuth tokens / 3-header credentials as plain Go strings on the
Connector struct. Encrypted at rest via internal/crypto/encryption.go
(AES-256-GCM v3 + PBKDF2-600k), they sit in process memory in the
clear after load and are sent in HTTP headers on every API call.
Under DEBUG-level HTTP request logging, the headers leak.

This commit ships the foundation type. Per-connector migrations
(GlobalSign / EJBCA / Sectigo Config field changes from string to
*secret.Ref, plus auth-header write-path changes) are Phase 2 — a
separate commit per connector keeps each diff reviewable.

Phase 1 (this commit):
- internal/secret/secret.go with Ref:
    NewRef(src func() ([]byte, error))   — production: decrypt-on-demand
    NewRefFromString(s string)            — tests / config-loading
    Use(fn func(buf []byte) error)        — invoke fn with a fresh
                                            buffer, zero on return
    WriteTo(w io.Writer)                  — convenience for the
                                            "set a header" case
    String()                              — returns "[redacted]"
    MarshalJSON()                         — returns "[redacted]"
    IsEmpty()                             — for ValidateConfig paths
- The bytes are zeroed (every byte set to 0) after Use returns —
  defeats casual heap-dump extraction. The `[redacted]` brackets
  (rather than `<redacted>`) avoid Go's json HTMLEscape behavior.
- 9 unit tests covering: bytes-exposed-and-zeroed contract, the
  buffer-escape anti-pattern (asserts post-Use buffer is zeroed),
  WriteTo, String/MarshalJSON redaction, JSON-encoding inside a
  parent struct, nil-Ref safety on every method, source-error
  propagation, IsEmpty, direct test of the zero helper.

Phase 2 (separate follow-up commits):
- GlobalSign Config.APIKey / APISecret migration to *secret.Ref.
- EJBCA Config.Token migration to *secret.Ref.
- Sectigo Config.CustomerURI / Login / Password migration.
- Each migration includes the auth-header write-path change
  (setAuthHeaders → Ref.WriteTo) and the env-var-loading update
  (NewRefFromString at config load time).
- Outbound HTTP transport-wrapping for per-connector credential-
  header redaction in DEBUG logs (defense against third-party
  SDK leakage; not in scope for the foundation).

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #6 — Phase 1.
2026-05-02 02:22:07 +00:00
shankar0123 711265b652 asyncpoll: shared bounded-polling Poller + DigiCert refactor (Phase 1)
Phase 1 of the #5 acquisition-readiness fix from the 2026-05-01 issuer
coverage audit. Pre-fix, four async-CA connectors (DigiCert, Sectigo,
Entrust, GlobalSign) had GetOrderStatus paths that polled the upstream
on every scheduler tick with no exponential backoff, no max-retry cap,
and no deadline. The scheduler's tick rate (typically 30s) was the
only throttle — an unready order got hit every 30s indefinitely, and
a 429 from a rate-limited upstream produced "retry on the next tick"
which re-fanned-out the same call.

This commit ships the shared infrastructure (asyncpoll package) and
refactors DigiCert as the reference. Sectigo / Entrust / GlobalSign
follow the same mechanical pattern; they land in Phase 2.

Phase 1 (this commit):
- internal/connector/issuer/asyncpoll/asyncpoll.go: shared Poller
  with exponential backoff (5s → 15s → 45s → 2m → 5m capped),
  ±20% jitter, configurable MaxWait deadline (default 10m), and
  ctx-aware cancellation.
- Result enum: StillPending / Done / Failed. PollFunc returns
  (Result, err); Poll handles the wait loop, deadline check, and
  ctx propagation.
- ErrMaxWait sentinel for callers that want to distinguish
  "deadline exhausted" from "fn errored".
- asyncpoll_test.go: 11 tests covering happy path, transient error
  keep-polling, Failed terminates immediately, MaxWait timeout,
  MaxWait+lastErr wrap, ctx cancel, multiplicative backoff, jitter
  bounds (statistical), pct=0 deterministic, defaults applied.
- DigiCert refactor: GetOrderStatus now wraps pollOrderOnce in
  asyncpoll.Poll. Status-code triage:
    2xx + parse + status="issued"           → Done with cert
    2xx + parse + status="pending"          → StillPending
    2xx + parse + status="rejected"/"denied" → Done with status="failed"
    2xx + parse fail                        → Failed (permanent)
    4xx (not 429)                           → Failed (404 = order
                                              doesn't exist)
    429 / 5xx / network                     → StillPending
- Config.PollMaxWaitSeconds (env: CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS)
  exposes the per-call deadline knob; default 600 (10m).
- Test helper buildDigicertConnector + GetOrderStatus_Pending test
  set PollMaxWaitSeconds=1 so async-pending tests don't block 10
  minutes on the production default.

Phase 2 (separate follow-up commit, not in this PR):
- Sectigo refactor (collectNotReady sentinel maps to StillPending).
- Entrust refactor (approval-pending → longer per-issuer MaxWait).
- GlobalSign refactor (serial-tracking; same Poller).
- Per-connector cadence integration tests against fake HTTP servers.
- docs/async-polling.md + docs/connectors.md updates.

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #5 — Phase 1.
2026-05-02 02:18:50 +00:00
shankar0123 74d6b462a4 metrics: gofmt issuance_metrics_test.go — fix CI
Trivial whitespace fix: gofmt collapsed three trailing-comment columns
that I'd hand-aligned in the test file. Local sandbox missed this
because the per-file gofmt run earlier in the commit cycle was scoped
to the changed-files list and didn't include the test file at the
final write moment; CI's project-wide `gofmt -l .` caught it.

Behavior unchanged.
2026-05-02 01:27:33 +00:00
shankar0123 3b92048242 metrics: add per-issuer-type issuance counters, histogram, and failure classifier
Closes the #4 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Before this commit, certctl's Prometheus exposition had
zero per-issuer-type signal — operators answering "is DigiCert slow?"
or "is Sectigo failing more than ACME?" had to grep logs by issuer
name. This commit adds three series labelled by issuer type:

  certctl_issuance_total{issuer_type, outcome}
  certctl_issuance_duration_seconds{issuer_type}            (histogram)
  certctl_issuance_failures_total{issuer_type, error_class}

The histogram covers 0.05–120 second buckets to span the local-issuer
fast path and async-CA slow path (DigiCert/Sectigo/Entrust polling
can take minutes). error_class is a closed enum of eight values
(timeout, auth, rate_limited, validation, upstream_5xx, upstream_4xx,
network, other) classified once in service.ClassifyError. Cardinality
budget is ~276 new series, well within Prometheus's comfortable range.

Implementation:
- service.IssuanceMetrics is the thread-safe counter + histogram
  table. Three independent views (counters / failures / durations)
  exposed via SnapshotCounters / SnapshotFailures / SnapshotDurations.
  sync.RWMutex protects the map shape; per-key sync/atomic.Uint64
  primitives keep the recording hot path lock-free under concurrent
  service-layer goroutines.
- service.IssuanceCounterEntry / IssuanceFailureEntry /
  IssuanceDurationEntry / IssuanceMetricsSnapshotter live in service
  (not handler) to avoid an import cycle: handler already imports
  service for admin_est.go etc., so service can't import handler back.
  Handler's exposer takes the snapshotter via the service-defined
  interface.
- service.ClassifyError pure function maps error → error_class.
  context.DeadlineExceeded / context.Canceled → timeout; *net.OpError
  → network; substring matches against canonical AWS / DigiCert /
  Sectigo error shapes for auth / rate_limited / validation /
  upstream_5xx / upstream_4xx / network; unknown → other. Each branch
  has at least one representative test case in
  TestClassifyError.
- IssuerConnectorAdapter.SetMetrics wires per-adapter recording
  (issuerType + metrics). Existing 28+ test call sites of
  NewIssuerConnectorAdapter keep their one-arg signature; production
  wiring goes through SetMetrics post-construction.
- IssuerRegistry.SetIssuanceMetrics + Rebuild type-asserts to
  *IssuerConnectorAdapter and calls SetMetrics with the issuer type
  string. nil-guarded — tests that hand-build adapters without
  metrics get no-op recording.
- IssuerConnectorAdapter.IssueCertificate / RenewCertificate wrap the
  underlying connector call with start := time.Now() and
  recordIssuance(start, err). Renewal is recorded into the same
  certctl_issuance_* series as initial issuance — operationally,
  renewal IS issuance from the connector's perspective (matches the
  audit prompt's guidance on series naming).
- handler/metrics.go GetPrometheusMetrics gains a new exposer block
  emitting all three series in stable label order with correct
  Prometheus format (_bucket / _sum / _count for the histogram, +Inf
  bucket appended). Sorted via sort.Slice for stable output. nil-
  guarded so deploys without the wire produce clean exposition.
- formatLE helper trims trailing zeros from histogram bucket labels
  via strconv.FormatFloat(le, 'f', -1, 64) so the `le` labels match
  Prometheus client conventions ("0.05", "30", "120", not "0.0500"
  etc.).
- cmd/server/main.go wires a single IssuanceMetrics instance into
  both the IssuerRegistry (recording) and the MetricsHandler (exposer)
  using DefaultIssuanceBucketBoundaries.

Tests:
- TestIssuanceMetrics_RecordAndSnapshot — happy-path counter +
  histogram + failure recording, BucketBoundaries returns a copy
  (not shared storage).
- TestIssuanceMetrics_HistogramCumulative — pins the cumulative-buckets
  contract. 100ms observation lands in 0.1 bucket and every larger
  bucket; 750ms only in the 1.0 bucket. Off-by-one here would
  corrupt every quantile query downstream.
- TestIssuanceMetrics_Concurrency — 100 goroutines × 1000 ops under
  the race detector. Asserts atomic counter integrity across
  contended writes.
- TestClassifyError — 17 cases covering every branch of the closed
  enum plus the nil-error special case.

Implementation chooses the existing hand-rolled fmt.Fprintf
exposition pattern (no prometheus/client_golang dependency added)
to stay consistent with the OCSP / deploy counter blocks already in
the file.

Out of scope (separate follow-ups):
- Revocation metrics (certctl_revocation_*) — symmetric to issuance
  but the audit didn't ask; explicit follow-up commit.
- Discovery / health-check duration histograms.
- prometheus/client_golang migration.

Verified locally:
- gofmt clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 ./internal/service/ green
- go test -short -count=1 -race -run TestIssuanceMetrics ./internal/service/ green
- go test -short -count=1 ./internal/api/handler/ green
- go build ./... success

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #4 (Part 3, narrative section).
2026-05-02 00:39:25 +00:00
shankar0123 b0efdbe2f8 repo,service: introduce WithinTx and atomic audit rows for issue/renew/revoke
Closes the #3 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit (Part 1.5 finding #1: audit row not transactional with
issuance). AuditRepository.Create previously ran on the package-level
*sql.DB while the certificate insert / version insert / revocation
insert ran on independent connections — a failed audit INSERT after
a successful operation INSERT was silently lost. SOX §404 over IT
general controls, PCI-DSS §10 audit logging, HIPAA §164.312(b) audit
controls, and CA/B Forum Baseline Requirements §5.4.1 audit log
records all presume audit-with-operation atomicity.

Design — Option A (Querier abstraction). The chosen pattern: a shared
repository.Querier interface (subset of *sql.DB and *sql.Tx) plus a
postgres.WithinTx helper that begins a tx, runs fn, commits on nil
error, rolls back on error or panic, and returns the wrapped result.
Repository methods that participate in a service-layer transaction
expose a *WithTx variant taking repository.Querier; the bare methods
remain for stand-alone use. A repository.Transactor abstracts the
"begin tx, run fn, commit/rollback" lifecycle so service-layer code
runs multi-write operations atomically without holding *sql.DB
directly. Option B (UnitOfWork) was considered but adds boilerplate
without behavioral benefit for the current scope. Option C
(context-carried tx) was explicitly rejected — it hides the
transactional boundary from the type system, reproducing the class
of bug we're fixing.

This commit:
- Adds internal/repository/querier.go with the Querier interface
  (compile-time guards that *sql.DB and *sql.Tx satisfy it) and the
  Transactor interface for service-layer use.
- Adds internal/repository/postgres/tx.go with the WithinTx helper
  (begin/fn/commit/rollback with panic recovery) and a transactor
  type that satisfies repository.Transactor.
- Adds CreateWithTx variants on AuditRepository, CertificateRepository
  (Create + Update + CreateVersion), and RevocationRepository.
  Existing bare methods now delegate to the *WithTx variant using
  the package-level *sql.DB so existing call sites are
  behavior-preserving.
- Updates repository/interfaces.go: AuditRepository, CertificateRepository,
  and RevocationRepository declare the new *WithTx methods. Adds an
  atomicity contract doc-comment on AuditRepository pointing at
  WithinTx + the audit blocker.
- Adds AuditService.RecordEventWithTx, mirroring RecordEvent but
  routing through CreateWithTx so the audit row is part of the
  caller's transaction. Same redaction + marshalling contract.
- Refactors three audit-emitting service paths to use Transactor.WithinTx
  when SetTransactor was wired, with a legacy fallback for backward
  compat:
    * CertificateService.Create — cert insert + audit row in one tx.
    * RevocationSvc.RevokeCertificateWithActor — cert status update +
      revocation row + audit row in one tx. The OCSP cache invalidate
      remains best-effort (out of scope per the prompt).
    * RenewalService CompleteServerRenewal — cert version insert +
      cert update + audit row in one tx. Job status update stays
      outside the audit-atomicity scope (job state lives outside
      the operator-facing audit trail).
- Adds SetTransactor on CertificateService, RevocationSvc, and
  RenewalService. cmd/server/main.go wires a single Transactor
  instance shared across all three so all audit-emitting paths run
  their writes in transactions backed by the same *sql.DB handle.
- Updates 5 mock implementations to satisfy the new interface methods:
  mockCertRepo (testutil_test.go), mockCertRepoWithGetError
  (shortlived_test.go), fakeRevocationRepo (crl_cache_test.go),
  intuneE2EAuditRepo (scep_intune_e2e_test.go), and the integration-
  test mocks (lifecycle_test.go: mockCertificateRepository,
  mockAuditRepository, mockRevocationRepository). All *WithTx mocks
  ignore the Querier and delegate to the bare method (mocks have no
  DB; in-memory state is shared regardless of "tx").
- Adds a service-layer test mockTransactor with BeginTxErr and
  CommitErr knobs so the atomic-audit tests can assert error
  propagation through the transactional boundary.
- Adds internal/repository/postgres/tx_test.go: unit-level test that
  WithinTx surfaces "begin tx" wrap when BeginTx fails, and that
  Transactor.WithinTx delegates correctly. Real-Postgres rollback
  semantics are covered by the testcontainers tests in the postgres
  package — sandbox disk pressure prevented adding a sqlmock dep
  for the in-fn / commit-failure unit test, so those scenarios are
  exercised through atomic_audit_test.go using the mockTransactor's
  CommitErr / BeginTxErr fields.
- Adds internal/service/atomic_audit_test.go:
    * TestCertificateService_Create_AtomicWithTx — asserts audit
      insert failure inside the tx surfaces as the operation's error
      (closes the blocker contract).
    * TestCertificateService_Create_LegacyPathLogs — pins the
      backward-compat behavior when SetTransactor isn't wired:
      audit failure is logged-not-failed, matching pre-fix.
    * TestCertificateService_Create_TransactorBeginFailure — BeginTx
      error path: operation fails, no cert insert, no audit insert.
    * TestCertificateService_Create_TransactorCommitFailure —
      Commit error after successful in-fn writes surfaces as the
      operation's error. Real Postgres can fail Commit on
      serialization conflicts; the service must report this.

Out of scope (separate follow-up commits, same shape):
- Issuer CRUD audit atomicity.
- Target CRUD audit atomicity.
- Agent retire (already transactional via RetireAgentWithCascade;
  verified, not changed).
- Renewal-policy CRUD audit atomicity.
- Owner/team/agent-group CRUD audit atomicity.
- Discovery / health-check audit atomicity.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 ./internal/service/ green
- go test -short -count=1 ./internal/api/handler/ green
- go test -short -count=1 ./internal/integration/ green
- go test -short -count=1 ./internal/repository/postgres/ green
- go build ./... success

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #3 (Part 3, narrative section).
2026-05-02 00:29:09 +00:00
shankar0123 3669556e57 ejbca: wire mTLS client cert in New()
Closes the #2 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. New() at ejbca.go:L79-L88 previously constructed an
http.Client with only Timeout set — no Transport, no TLSClientConfig.
When AuthMode=mtls (the default), the client never presented the
configured ClientCert/ClientKey. The OAuth2 path worked; mTLS always
failed authentication. Tests passed because they injected a pre-built
*http.Client via NewWithHTTPClient, a path the production factory never
took.

This commit:
- Rewrites New() to load ClientCertPath + ClientKeyPath via
  tls.LoadX509KeyPair when AuthMode=mtls, configure
  *http.Transport.TLSClientConfig with MinVersion: TLS 1.2 (compatibility
  floor for on-prem EJBCA installs that may predate TLS 1.3), and return
  (*Connector, error). Constructs a fresh *http.Transport — does NOT
  clone http.DefaultTransport, which would leak mutation across the
  package boundary.
- OAuth2 mode unchanged: returns a client with no transport
  customization (the Bearer header path is wired in setAuthHeaders).
- Invalid auth_mode values return (nil, error) immediately rather than
  falling through to the mtls default and erroring at cert load.
- Updates the factory call site at issuerfactory/factory.go for the
  new signature; the factory's outer (issuer.Connector, error) shape
  was already in place.
- Adds TestNew_MTLSWiresClientCert: calls production New() (NOT
  NewWithHTTPClient) with real cert/key files generated via stdlib
  crypto/x509, asserts httpClient.Transport.TLSClientConfig.Certificates
  is non-empty. Includes an httptest TLS server with
  ClientAuth: tls.RequireAndVerifyClientCert that proves the cert is
  actually presented on the wire — not just stashed in a struct field.
- Adds TestNew_MTLSCertLoadFailure: missing-cert path returns an error
  wrapping fs.ErrNotExist (verified via errors.Is).
- Adds TestNew_OAuth2NoTransportTuning: OAuth2 path leaves Transport
  nil, ensuring no accidental mTLS bleedthrough.
- Adds TestNew_InvalidAuthMode: explicit guard that auth_mode values
  other than "mtls"/"oauth2" return (nil, error) at New() time.
- Adds export_test.go with HTTPClientForTest helper so the external
  ejbca_test package can inspect the connector's internal *http.Client
  for the wiring assertions. Compile-only during `go test`; production
  builds don't expose it.
- Adds mustNewForValidateConfig test helper (OAuth2 placeholder
  connector) for the existing ValidateConfig-only tests; pre-fix they
  used New(nil, ...) which is no longer valid because nil config falls
  into the mTLS default branch that requires non-nil cert paths.
- Updates ejbca_stubs_test.go (internal package) for the new
  (*Connector, error) signature; switches the dummy connector to
  OAuth2 mode so Config{} doesn't error at New().

Out of scope (separate follow-ups, per the prompt's explicit fence):
- OAuth2 token refresh missing
- Config.Token plaintext at runtime (needs SecretRef abstraction)
- RevokeCertificate composite OrderID parsing (the issuerDN := "" line
  at ejbca.go:L313)

Verified locally:
- gofmt clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 ./internal/connector/issuer/ejbca/ green
- go test -short -count=1 ./internal/connector/issuerfactory/ green
- go test -short -count=1 ./internal/service/ green
- go build ./... success

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #2.
2026-05-02 00:08:24 +00:00
shankar0123 804a1b05ce awsacmpca: thread ctx through factory + registry — fix CI contextcheck
Follow-up to 590f654 (awsacmpca: replace stub client with AWS SDK v2
implementation). CI's golangci-lint contextcheck rule flagged six
violations in awsacmpca_test.go where mustNew/awsacmpca.New were
called from test functions that had ctx in scope but didn't thread it
through New(). The previous commit used context.Background() inside
New() with the rationale that "the audit allows either threading or
documenting the limitation"; CI made that choice for us.

Threading ctx is the right shape per the audit's stated preference.
The fix cascades from awsacmpca.New through issuerfactory.NewFromConfig
and IssuerRegistry.Rebuild because the contextcheck rule propagates
upward through every caller that has ctx in scope.

This commit:
- Changes awsacmpca.New(config, logger) to
  awsacmpca.New(ctx, config, logger). The ctx is passed to
  buildSDKClient → awsconfig.LoadDefaultConfig so SDK credential chain
  resolution honors caller deadlines (LoadDefaultConfig may probe IMDS
  or remote credential sources). The doc-comment on New explains that
  callers without a useful deadline should pass context.Background()
  and that the SDK has internal credential-resolution timeouts.
- Adds ctx as the first parameter of issuerfactory.NewFromConfig.
  Currently only the AWSACMPCA branch uses ctx (it's threaded into
  awsacmpca.New); the other 11 branches accept ctx without using it.
  This is a contractual change that lets callers thread ctx through
  without contextcheck warnings, even though most issuer constructors
  do no ctx-aware work today.
- Adds ctx as the first parameter of IssuerRegistry.Rebuild. Rebuild
  iterates over configs and calls NewFromConfig per issuer; the same
  ctx flows through every connector instantiation.
- Updates the two production call sites in internal/service:
  - issuer.go:279 (TestIssuer connection test) now passes its
    method-scoped ctx
  - issuer.go:303 (BuildRegistry) now passes its method-scoped ctx
    to Rebuild
- Updates 13 test sites in internal/connector/issuerfactory/factory_test.go
  via a new testCtx() helper that returns context.Background(). Helper
  is dedicated to this file so contextcheck's "you have a ctx in scope,
  pass it" rule doesn't fire on test functions that don't otherwise
  need ctx.
- Updates 6 test sites in internal/service/issuer_registry_test.go
  to pass context.Background() to Rebuild.
- Removes the now-stale "// NewFromConfig has no ctx parameter
  (preserved across all 12 connectors); pass context.Background() ..."
  comment from the awsacmpca branch in factory.go — that workaround
  is no longer the design.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... clean (was failing with 6
  contextcheck issues before the cascade; now 0 issues)
- go test -short -count=1 across all changed packages green

Sandbox couldn't run the existing CI's full make verify due to
disk pressure on /sessions and a virtiofs concurrent-open-file
ceiling on go mod tidy; operator should run `make verify` on
the workstation to confirm.

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #1 (CI follow-up; behavior unchanged from 590f654).
2026-05-01 23:27:25 +00:00
shankar0123 590f654b0d awsacmpca: replace stub client with AWS SDK v2 implementation
Closes the #1 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. The production New() constructor previously hardcoded
&stubClient{}, which returned "AWS SDK client not initialized (stub)" on
every method. Tests passed green via NewWithClient mock injection — a
path the production constructor never took. AWSACMPCA was wired into
the factory, the seed file, the test suite, and marketing collateral
but did not actually issue, retrieve, or revoke certificates.

This commit:
- Adds aws-sdk-go-v2/{config,service/acmpca,aws} to go.mod (with
  acmpca/types as a sub-package). go mod tidy could not be completed
  in the sandbox due to virtiofs concurrent-open-file ceiling on the
  module cache; the require blocks were arranged manually so the three
  directly-imported packages are non-indirect. Build, vet, staticcheck,
  and the full test suite are green; operator should run `go mod tidy`
  on the workstation to confirm cosmetic ordering before pushing.
- Implements sdkClient wrapping *acmpca.Client with local input/output
  type translation. Each method translates the connector's local input
  type to the SDK's typed input, calls the SDK, and translates the SDK
  output back to the local output type. aws-sdk-go-v2 types do not
  leak out of the awsacmpca package.
- Deletes stubClient (the four "AWS SDK client not initialized (stub)"
  methods). After this commit, there is no fall-back stub; production
  New() always wires the SDK.
- Rewrites New() to load credentials via awsconfig.LoadDefaultConfig
  with awsconfig.WithRegion(config.Region) and construct the SDK client
  via acmpca.NewFromConfig. Returns (*Connector, error). When config
  is nil or config.Region is empty, New defers SDK loading; ValidateConfig
  builds the client lazily on the first successful validation. This
  preserves the test pattern of New(nil, logger) → ValidateConfig.
- Wires acmpca.NewCertificateIssuedWaiter (5-minute default timeout)
  inside sdkClient.IssueCertificate so the connector's two-call
  pattern (IssueCertificate → GetCertificate) sees synchronous-via-
  waiter semantics. The waiter is hidden from the ACMPCAClient
  interface so mock implementations stay simple.
- Maps RFC 5280 revocation reasons to acmpcatypes.RevocationReason
  via the existing mapRevocationReason helper plus a cast at the
  sdkClient.RevokeCertificate boundary.
- Updates the issuerfactory.NewFromConfig call site at factory.go:L88
  for the new (*Connector, error) signature; the factory's outer
  signature already returns (issuer.Connector, error) so the change
  is local.
- Adds nil-client guards on the four client-using connector methods
  (IssueCertificate, RevokeCertificate, GetCACertPEM, plus the
  RenewCertificate path via IssueCertificate). When the connector is
  used before ValidateConfig has been called, these methods fail-fast
  with a "client not initialized" sentinel error instead of panicking.
- Fixes the copy-paste env-var doc-comments at awsacmpca.go:L41,L45
  (CERTCTL_GOOGLE_CAS_PROJECT / CERTCTL_GOOGLE_CAS_CA_ARN →
  CERTCTL_AWS_PCA_REGION / CERTCTL_AWS_PCA_CA_ARN). The actual config
  loader at internal/config/config.go:L1556-L1561 already used the
  correct env-var names; only the doc-comments were wrong.
- Updates the package doc-comment at awsacmpca.go:L1-L36 to clarify
  the synchronous-via-waiter behavior (issuance is asynchronous at
  the API level; the waiter inside sdkClient.IssueCertificate hides
  the asynchrony).
- Adds TestNew_ProductionPath/ValidConfigBuildsRealClient: calls
  production New() (NOT NewWithClient) with a valid config, asserts
  err is nil, then calls IssueCertificate with a bogus CSR and asserts
  the resulting error is the expected PEM-decode error rather than
  the deleted stubClient's "client not initialized" sentinel. This is
  the regression-marker test the audit's D11 blocker called out as
  missing — if anyone re-introduces a stub-style placeholder from
  production New() in the future, this test fails.
- Adds TestNew_ProductionPath/NilConfigDefersClientInit: documents the
  lazy-init contract for the New(nil, logger) → ValidateConfig pattern.
- Adds TestNew_ProductionPath/ValidateConfigBuildsClientLazily: verifies
  that ValidateConfig wires the SDK client when New was called with
  nil config.
- Adds TestNew_ProductionPath/{Revoke,GetCAPEM}BeforeInitFailsFast:
  verifies the nil-client guards on the other client-using methods.
- Adds TestNew_ErrorPaths covering AccessDeniedException-shaped errors,
  transient 5xx errors, and ctx-cancel propagation via the existing
  mockACMPCAClient.
- Updates docs/connectors.md:L490-L555 with: the synchronous-via-waiter
  behavior, a complete IAM policy example scoped to the four ACM PCA
  actions, a worked POST /api/v1/issuers example, and a troubleshooting
  section with three known failure modes (AccessDeniedException,
  ResourceNotFoundException, waiter timeout).

Live AWS integration testing is intentionally not added: ACM PCA is a
Pro-tier feature in localstack and the existing interface-mock tests
cover correctness end-to-end. Operators with AWS credentials can
validate by following the worked example in docs/connectors.md.

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #1 (Part 3, narrative section).
2026-05-01 23:13:59 +00:00
shankar0123 7cb453a336 chore(fmt): repo-wide gofmt -w sweep — close drift surfaced by ci-pipeline-cleanup Phase 4
Mechanical reformat. The new 'gofmt drift' CI step (added in
ci-pipeline-cleanup Phase 4, commit 0f205a8) surfaced 111 files
with accumulated gofmt drift across cmd/, internal/, and deploy/test/.

Each file's diff is gofmt-standard: whitespace adjustments, intra-
group import sorting (alphabetical by import path within blank-line-
separated groups), and struct-tag column alignment. No semantic
changes — verified via 'git diff --ignore-all-space' which shows only
the line-position deltas from import reordering.

The gate stays in place after this commit. Going forward it catches
gofmt drift at PR time.
2026-04-30 22:33:57 +00:00
shankar0123 77abb7096c fix(config): wire CERTCTL_DEPLOY_BACKUP_RETENTION + CERTCTL_K8S_DEPLOY_KUBELET_SYNC_TIMEOUT to satisfy G-3 docs-drift guard
CI failed on the G-3 docs-drift guard for the deploy-hardening I
release commit (88e8a417 / b95a548 docs commit): the docs at
docs/features.md mention CERTCTL_DEPLOY_BACKUP_RETENTION and
CERTCTL_K8S_DEPLOY_KUBELET_SYNC_TIMEOUT but config.go didn't
declare or load them. Classic "lying field" — operator-visible
documented env var that quietly does nothing because the wire
never reaches the consumer.

Per CLAUDE.md operating rule "Always take the complete path, not
the easy path": fix the wire instead of removing the docs.

Adds two fields to CertManagementConfig:
- DeployBackupRetention int (default 3, frozen decision 0.2)
- K8sDeployKubeletSyncTimeout time.Duration (default 60s, Phase 9)

Loaded in NewConfig via getEnvInt + getEnvDuration. Each field
documented with its source phase + frozen-decision reference for
auditors.

These config values are loaded but not yet consumed by the agent
(per Phase 10's deferral note: "agent-side wire-up is intentionally
deferred to a follow-up commit"). The follow-up wires the agent's
deployment dispatch site to inject cfg.CertManagement.DeployBackupRetention
into the per-target deploy.Plan and to pass K8sDeployKubeletSyncTimeout
to the k8ssecret connector. For now: the env vars are loaded, the
config struct holds them, the docs accurately describe the operator
contract, and the G-3 guard passes.

Local G-3 reproduction:
  DOCS_ONLY: (empty)
  CONFIG_ONLY: (empty)

Build + vet + golangci-lint v2.11.4 + go test ./internal/config/...
all clean.
2026-04-30 15:56:41 +00:00