Per Phase 1 audit at cowork/docs-overhaul-phase-1-audit-2026-05-04/. Adds a `> Last reviewed: 2026-05-05` line right after the H1 heading of every doc that didn't already have one (41 files). This dates the freshness clock for the future Phase 4 per-doc review. The discipline going forward: when a doc's content gets a meaningful edit, bump the date. When the date gets old (e.g., >6 months), the doc earns a freshness-review pass. Mechanical insertion via awk one-liner, applied to every docs/*.md that didn't already match `grep -q 'Last reviewed:'`. Files that already carried the line from earlier Phase 2 work (the navigation index, the new connector docs, the new SCEP server / legacy-clients- TLS-1.2 / release-verification docs, and the 5 per-connector deep dives) were skipped to avoid duplicate insertion. Net: every doc in docs/ now has a Last reviewed line.
5.2 KiB
Async-CA Polling — Operator Reference
Last reviewed: 2026-05-05
Closes audit fix #5 from the 2026-05-01 issuer-coverage acquisition-readiness audit.
What this is
Four issuer connectors talk to Certificate Authorities that issue
certificates asynchronously — IssueCertificate returns an order
ID immediately, and the caller (or scheduler) must call
GetOrderStatus later to retrieve the issued cert:
- DigiCert (CertCentral)
- Sectigo (Certificate Manager)
- Entrust (Certificate Services / CA Gateway)
- GlobalSign (Atlas HVCA)
Pre-fix, each connector's GetOrderStatus made one HTTP call per
invocation with no exponential backoff, no retry cap, and no deadline.
Under a renewal sweep, certctl would hammer the upstream CA's
rate-limit budget. A 429 response was treated as a hard error,
which then caused the scheduler to retry on the next tick — re-fanning
out the same call that just got rate-limited.
Post-fix, GetOrderStatus blocks for up to PollMaxWait (default
10 minutes) doing bounded internal polling:
attempt 1 → wait 5s → attempt 2 → wait 15s → attempt 3 → wait 45s →
attempt 4 → wait 2m → attempt 5 → wait 5m → ... (capped at 5m)
±20% jitter applied at every wait so multiple certctl instances
never synchronize on the upstream CA's rate-limit window. The
PollMaxWait deadline is a hard cap; if the upstream still hasn't
completed by then, GetOrderStatus returns StillPending and the
scheduler can re-enqueue the job for a future tick.
Status-code triage
Each connector classifies HTTP responses to drive polling decisions:
| Response | Meaning | Decision |
|---|---|---|
| 2xx + status="issued"/"completed" | Cert ready | Done — return the cert |
| 2xx + status="pending"/"processing" | Still working | StillPending — keep polling |
| 2xx + status="rejected"/"denied"/"failed" | Permanent | Done — return OrderStatus{Status:"failed"} |
| 2xx + parse failure | Body is broken | Failed — return error |
| 4xx (404/400/401/403) | Permanent client error | Failed — return error |
| 429 (rate limited) | Transient | StillPending — keep polling with backoff |
| 5xx | Transient | StillPending — keep polling with backoff |
| Network / TLS error | Transient | StillPending — keep polling with backoff |
Operator tuning
Each connector exposes a PollMaxWaitSeconds config field and
matching env var:
| Connector | Env var | Default |
|---|---|---|
| DigiCert | CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS |
600 (10m) |
| Sectigo | CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS |
600 (10m) |
| Entrust | CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS |
600 (10m) |
| GlobalSign | CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS |
600 (10m) |
Tune up (e.g., 86400 = 24 hours) for Entrust approval-pending
workflows where humans manually approve enrollments. Tune down (e.g.,
60) for high-throughput environments that prefer to recycle the
scheduler tick rather than block one renewal goroutine for minutes.
A value of 0 (or unset) falls back to the package default in
internal/connector/issuer/asyncpoll.
Failure modes
Upstream returns 429 forever. The Poller respects the backoff
(5s → 15s → 45s → 2m → 5m), so a sustained 429 stream burns through
the full PollMaxWait budget with at most 7-8 attempts (instead of
~600 attempts at 1/sec). After PollMaxWait expires, GetOrderStatus
returns StillPending; the scheduler re-enqueues for the next tick.
The total request volume against the upstream is bounded by tick interval / minimum backoff — typically 1-2 requests per minute even
under heavy load.
Sectigo collectNotReady sentinel. When the SCM status endpoint
reports Issued but the cert collect endpoint isn't yet ready, the
old code branched into a special "pending" return. Now that branch
returns StillPending from the poll closure, so the cert collection
rides the same backoff schedule.
Entrust approval-pending. The AWAITING_APPROVAL status maps to
StillPending. With the default PollMaxWait=10m, the scheduler
will re-enqueue once per tick if approval hasn't happened yet; with
PollMaxWait=24h the same renewal goroutine waits the full approval
window. Pick the latter when you have many approval-pending
enrollments per tick.
Where the implementation lives
internal/connector/issuer/asyncpoll/asyncpoll.go— sharedPollerwith backoff math, jitter, deadline, and ctx-aware cancellation.internal/connector/issuer/digicert/digicert.go—pollOrderOnce+GetOrderStatusorchestrator.internal/connector/issuer/sectigo/sectigo.go—pollEnrollmentOnce+ status-code permanence triage (isPermanentStatusError).internal/connector/issuer/entrust/entrust.go—pollEnrollmentOnce+ approval-pending mapping.internal/connector/issuer/globalsign/globalsign.go—pollCertificateOnce(serial-number tracking).internal/connector/issuer/asyncpoll/asyncpoll_test.go— 11 unit tests covering happy path, transient-then-success, Failed termination, MaxWait timeout, last-error wrap, ctx cancel, multiplicative backoff, jitter bounds, defaults.
Audit blocker reference
cowork/issuer-coverage-audit-2026-05-01/RESULTS.md, Top-10 fix #5 (Part 1.5 finding #4: "No polling backoff for async CAs").