Files
certctl/docs/async-polling.md
T
shankar0123 825fcf39a4 asyncpoll: refactor Sectigo / Entrust / GlobalSign to bounded polling (Phase 2)
Phase 2 of the #5 acquisition-readiness fix from the 2026-05-01 issuer
coverage audit. Phase 1 (commit 593210f) shipped the shared asyncpoll
package and refactored DigiCert as the reference. This commit applies
the same pattern to the remaining three async-CA connectors and adds
the operator-facing docs.

Per-connector refactors:

- Sectigo (sectigo.go): GetOrderStatus now wraps pollEnrollmentOnce in
  asyncpoll.Poll. The collectNotReady sentinel (cert approved by SCM
  but not yet retrievable from the collect endpoint) maps to
  StillPending and rides the backoff schedule rather than the prior
  "return pending immediately" branch. Added isPermanentStatusError
  helper to distinguish transient HTTP errors (5xx / 429 / network)
  from permanent ones (4xx / parse failure) — the wrapped checkStatus
  errors get triaged at the poll closure boundary.

- Entrust (entrust.go): GetOrderStatus wraps pollEnrollmentOnce. The
  AWAITING_APPROVAL status maps to StillPending; operators using
  approval-pending workflows where humans approve enrollments should
  bump CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS to 86400 (24h) so a
  single scheduler tick can wait through the approval window. The
  default 10-minute deadline matches the other three connectors.

- GlobalSign (globalsign.go): GetOrderStatus wraps pollCertificateOnce.
  GlobalSign tracks orders by serial number rather than order ID, but
  the polling shape is identical to the other three. Status-code
  triage matches DigiCert: 4xx (not 429) is permanent, 5xx / 429 /
  network is transient.

Per-connector Config field added:
- DigiCert.PollMaxWaitSeconds (env CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS)
- Sectigo.PollMaxWaitSeconds (env CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS)
- Entrust.PollMaxWaitSeconds (env CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS)
- GlobalSign.PollMaxWaitSeconds (env CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS)

internal/config/config.go env-var loaders updated for all four. Default
is 600 seconds (10 minutes); zero falls back to the asyncpoll package
default.

Test-helper updates: every existing test that exercises the pending
branch (collectNotReady, AWAITING_APPROVAL, status="pending", etc.)
now sets PollMaxWaitSeconds=1 in its Config so the test doesn't block
on the production-default 10-minute deadline. Tests that exercise
permanent-error branches (404, 401, malformed JSON, etc.) continue
to return immediately.

Test sites updated:
- buildSectigoConnector helper + GetOrderStatus_CollectNotReady test
- buildEntrustConnector helper + GetOrderStatus_Pending test
- buildGlobalsignConnector helper + GetOrderStatus_Pending test +
  the GetHTTPClient_NoMTLSCertPaths test (network failure now rides
  the backoff schedule rather than returning immediately)

Documentation:
- docs/async-polling.md: new operator reference covering the backoff
  schedule, status-code triage, the four env vars, failure modes, and
  where the implementation lives. Audit blocker citation included.
- docs/connectors.md: per-issuer sections for DigiCert, Sectigo,
  Entrust, GlobalSign each gain the PollMaxWaitSeconds env var row
  and a cross-link to async-polling.md.

Lint cleanup: simplified the isPermanentStatusError branch to satisfy
staticcheck S1008 (single-line return for a final boolean check).

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 across all 4 connector packages + config + asyncpoll: green

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #5 — Phase 2.
2026-05-02 02:41:36 +00:00

5.2 KiB

Async-CA Polling — Operator Reference

Closes audit fix #5 from the 2026-05-01 issuer-coverage acquisition-readiness audit.

What this is

Four issuer connectors talk to Certificate Authorities that issue certificates asynchronouslyIssueCertificate returns an order ID immediately, and the caller (or scheduler) must call GetOrderStatus later to retrieve the issued cert:

  • DigiCert (CertCentral)
  • Sectigo (Certificate Manager)
  • Entrust (Certificate Services / CA Gateway)
  • GlobalSign (Atlas HVCA)

Pre-fix, each connector's GetOrderStatus made one HTTP call per invocation with no exponential backoff, no retry cap, and no deadline. Under a renewal sweep, certctl would hammer the upstream CA's rate-limit budget. A 429 response was treated as a hard error, which then caused the scheduler to retry on the next tick — re-fanning out the same call that just got rate-limited.

Post-fix, GetOrderStatus blocks for up to PollMaxWait (default 10 minutes) doing bounded internal polling:

attempt 1 → wait 5s  → attempt 2 → wait 15s → attempt 3 → wait 45s →
attempt 4 → wait 2m  → attempt 5 → wait 5m  → ... (capped at 5m)

±20% jitter applied at every wait so multiple certctl instances never synchronize on the upstream CA's rate-limit window. The PollMaxWait deadline is a hard cap; if the upstream still hasn't completed by then, GetOrderStatus returns StillPending and the scheduler can re-enqueue the job for a future tick.

Status-code triage

Each connector classifies HTTP responses to drive polling decisions:

Response Meaning Decision
2xx + status="issued"/"completed" Cert ready Done — return the cert
2xx + status="pending"/"processing" Still working StillPending — keep polling
2xx + status="rejected"/"denied"/"failed" Permanent Done — return OrderStatus{Status:"failed"}
2xx + parse failure Body is broken Failed — return error
4xx (404/400/401/403) Permanent client error Failed — return error
429 (rate limited) Transient StillPending — keep polling with backoff
5xx Transient StillPending — keep polling with backoff
Network / TLS error Transient StillPending — keep polling with backoff

Operator tuning

Each connector exposes a PollMaxWaitSeconds config field and matching env var:

Connector Env var Default
DigiCert CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS 600 (10m)
Sectigo CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS 600 (10m)
Entrust CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS 600 (10m)
GlobalSign CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS 600 (10m)

Tune up (e.g., 86400 = 24 hours) for Entrust approval-pending workflows where humans manually approve enrollments. Tune down (e.g., 60) for high-throughput environments that prefer to recycle the scheduler tick rather than block one renewal goroutine for minutes.

A value of 0 (or unset) falls back to the package default in internal/connector/issuer/asyncpoll.

Failure modes

Upstream returns 429 forever. The Poller respects the backoff (5s → 15s → 45s → 2m → 5m), so a sustained 429 stream burns through the full PollMaxWait budget with at most 7-8 attempts (instead of ~600 attempts at 1/sec). After PollMaxWait expires, GetOrderStatus returns StillPending; the scheduler re-enqueues for the next tick. The total request volume against the upstream is bounded by tick interval / minimum backoff — typically 1-2 requests per minute even under heavy load.

Sectigo collectNotReady sentinel. When the SCM status endpoint reports Issued but the cert collect endpoint isn't yet ready, the old code branched into a special "pending" return. Now that branch returns StillPending from the poll closure, so the cert collection rides the same backoff schedule.

Entrust approval-pending. The AWAITING_APPROVAL status maps to StillPending. With the default PollMaxWait=10m, the scheduler will re-enqueue once per tick if approval hasn't happened yet; with PollMaxWait=24h the same renewal goroutine waits the full approval window. Pick the latter when you have many approval-pending enrollments per tick.

Where the implementation lives

  • internal/connector/issuer/asyncpoll/asyncpoll.go — shared Poller with backoff math, jitter, deadline, and ctx-aware cancellation.
  • internal/connector/issuer/digicert/digicert.gopollOrderOnce + GetOrderStatus orchestrator.
  • internal/connector/issuer/sectigo/sectigo.gopollEnrollmentOnce + status-code permanence triage (isPermanentStatusError).
  • internal/connector/issuer/entrust/entrust.gopollEnrollmentOnce + approval-pending mapping.
  • internal/connector/issuer/globalsign/globalsign.gopollCertificateOnce (serial-number tracking).
  • internal/connector/issuer/asyncpoll/asyncpoll_test.go — 11 unit tests covering happy path, transient-then-success, Failed termination, MaxWait timeout, last-error wrap, ctx cancel, multiplicative backoff, jitter bounds, defaults.

Audit blocker reference

cowork/issuer-coverage-audit-2026-05-01/RESULTS.md, Top-10 fix #5 (Part 1.5 finding #4: "No polling backoff for async CAs").