Phase 2 of the #5 acquisition-readiness fix from the 2026-05-01 issuer
coverage audit. Phase 1 (commit 711265b) shipped the shared asyncpoll
package and refactored DigiCert as the reference. This commit applies
the same pattern to the remaining three async-CA connectors and adds
the operator-facing docs.
Per-connector refactors:
- Sectigo (sectigo.go): GetOrderStatus now wraps pollEnrollmentOnce in
asyncpoll.Poll. The collectNotReady sentinel (cert approved by SCM
but not yet retrievable from the collect endpoint) maps to
StillPending and rides the backoff schedule rather than the prior
"return pending immediately" branch. Added isPermanentStatusError
helper to distinguish transient HTTP errors (5xx / 429 / network)
from permanent ones (4xx / parse failure) — the wrapped checkStatus
errors get triaged at the poll closure boundary.
- Entrust (entrust.go): GetOrderStatus wraps pollEnrollmentOnce. The
AWAITING_APPROVAL status maps to StillPending; operators using
approval-pending workflows where humans approve enrollments should
bump CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS to 86400 (24h) so a
single scheduler tick can wait through the approval window. The
default 10-minute deadline matches the other three connectors.
- GlobalSign (globalsign.go): GetOrderStatus wraps pollCertificateOnce.
GlobalSign tracks orders by serial number rather than order ID, but
the polling shape is identical to the other three. Status-code
triage matches DigiCert: 4xx (not 429) is permanent, 5xx / 429 /
network is transient.
Per-connector Config field added:
- DigiCert.PollMaxWaitSeconds (env CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS)
- Sectigo.PollMaxWaitSeconds (env CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS)
- Entrust.PollMaxWaitSeconds (env CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS)
- GlobalSign.PollMaxWaitSeconds (env CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS)
internal/config/config.go env-var loaders updated for all four. Default
is 600 seconds (10 minutes); zero falls back to the asyncpoll package
default.
Test-helper updates: every existing test that exercises the pending
branch (collectNotReady, AWAITING_APPROVAL, status="pending", etc.)
now sets PollMaxWaitSeconds=1 in its Config so the test doesn't block
on the production-default 10-minute deadline. Tests that exercise
permanent-error branches (404, 401, malformed JSON, etc.) continue
to return immediately.
Test sites updated:
- buildSectigoConnector helper + GetOrderStatus_CollectNotReady test
- buildEntrustConnector helper + GetOrderStatus_Pending test
- buildGlobalsignConnector helper + GetOrderStatus_Pending test +
the GetHTTPClient_NoMTLSCertPaths test (network failure now rides
the backoff schedule rather than returning immediately)
Documentation:
- docs/async-polling.md: new operator reference covering the backoff
schedule, status-code triage, the four env vars, failure modes, and
where the implementation lives. Audit blocker citation included.
- docs/connectors.md: per-issuer sections for DigiCert, Sectigo,
Entrust, GlobalSign each gain the PollMaxWaitSeconds env var row
and a cross-link to async-polling.md.
Lint cleanup: simplified the isPermanentStatusError branch to satisfy
staticcheck S1008 (single-line return for a final boolean check).
Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 across all 4 connector packages + config + asyncpoll: green
Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #5 — Phase 2.
5.2 KiB
Async-CA Polling — Operator Reference
Closes audit fix #5 from the 2026-05-01 issuer-coverage acquisition-readiness audit.
What this is
Four issuer connectors talk to Certificate Authorities that issue
certificates asynchronously — IssueCertificate returns an order
ID immediately, and the caller (or scheduler) must call
GetOrderStatus later to retrieve the issued cert:
- DigiCert (CertCentral)
- Sectigo (Certificate Manager)
- Entrust (Certificate Services / CA Gateway)
- GlobalSign (Atlas HVCA)
Pre-fix, each connector's GetOrderStatus made one HTTP call per
invocation with no exponential backoff, no retry cap, and no deadline.
Under a renewal sweep, certctl would hammer the upstream CA's
rate-limit budget. A 429 response was treated as a hard error,
which then caused the scheduler to retry on the next tick — re-fanning
out the same call that just got rate-limited.
Post-fix, GetOrderStatus blocks for up to PollMaxWait (default
10 minutes) doing bounded internal polling:
attempt 1 → wait 5s → attempt 2 → wait 15s → attempt 3 → wait 45s →
attempt 4 → wait 2m → attempt 5 → wait 5m → ... (capped at 5m)
±20% jitter applied at every wait so multiple certctl instances
never synchronize on the upstream CA's rate-limit window. The
PollMaxWait deadline is a hard cap; if the upstream still hasn't
completed by then, GetOrderStatus returns StillPending and the
scheduler can re-enqueue the job for a future tick.
Status-code triage
Each connector classifies HTTP responses to drive polling decisions:
| Response | Meaning | Decision |
|---|---|---|
| 2xx + status="issued"/"completed" | Cert ready | Done — return the cert |
| 2xx + status="pending"/"processing" | Still working | StillPending — keep polling |
| 2xx + status="rejected"/"denied"/"failed" | Permanent | Done — return OrderStatus{Status:"failed"} |
| 2xx + parse failure | Body is broken | Failed — return error |
| 4xx (404/400/401/403) | Permanent client error | Failed — return error |
| 429 (rate limited) | Transient | StillPending — keep polling with backoff |
| 5xx | Transient | StillPending — keep polling with backoff |
| Network / TLS error | Transient | StillPending — keep polling with backoff |
Operator tuning
Each connector exposes a PollMaxWaitSeconds config field and
matching env var:
| Connector | Env var | Default |
|---|---|---|
| DigiCert | CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS |
600 (10m) |
| Sectigo | CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS |
600 (10m) |
| Entrust | CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS |
600 (10m) |
| GlobalSign | CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS |
600 (10m) |
Tune up (e.g., 86400 = 24 hours) for Entrust approval-pending
workflows where humans manually approve enrollments. Tune down (e.g.,
60) for high-throughput environments that prefer to recycle the
scheduler tick rather than block one renewal goroutine for minutes.
A value of 0 (or unset) falls back to the package default in
internal/connector/issuer/asyncpoll.
Failure modes
Upstream returns 429 forever. The Poller respects the backoff
(5s → 15s → 45s → 2m → 5m), so a sustained 429 stream burns through
the full PollMaxWait budget with at most 7-8 attempts (instead of
~600 attempts at 1/sec). After PollMaxWait expires, GetOrderStatus
returns StillPending; the scheduler re-enqueues for the next tick.
The total request volume against the upstream is bounded by tick interval / minimum backoff — typically 1-2 requests per minute even
under heavy load.
Sectigo collectNotReady sentinel. When the SCM status endpoint
reports Issued but the cert collect endpoint isn't yet ready, the
old code branched into a special "pending" return. Now that branch
returns StillPending from the poll closure, so the cert collection
rides the same backoff schedule.
Entrust approval-pending. The AWAITING_APPROVAL status maps to
StillPending. With the default PollMaxWait=10m, the scheduler
will re-enqueue once per tick if approval hasn't happened yet; with
PollMaxWait=24h the same renewal goroutine waits the full approval
window. Pick the latter when you have many approval-pending
enrollments per tick.
Where the implementation lives
internal/connector/issuer/asyncpoll/asyncpoll.go— sharedPollerwith backoff math, jitter, deadline, and ctx-aware cancellation.internal/connector/issuer/digicert/digicert.go—pollOrderOnce+GetOrderStatusorchestrator.internal/connector/issuer/sectigo/sectigo.go—pollEnrollmentOnce+ status-code permanence triage (isPermanentStatusError).internal/connector/issuer/entrust/entrust.go—pollEnrollmentOnce+ approval-pending mapping.internal/connector/issuer/globalsign/globalsign.go—pollCertificateOnce(serial-number tracking).internal/connector/issuer/asyncpoll/asyncpoll_test.go— 11 unit tests covering happy path, transient-then-success, Failed termination, MaxWait timeout, last-error wrap, ctx cancel, multiplicative backoff, jitter bounds, defaults.
Audit blocker reference
cowork/issuer-coverage-audit-2026-05-01/RESULTS.md, Top-10 fix #5 (Part 1.5 finding #4: "No polling backoff for async CAs").