Files
certctl/docs/reference/protocols/async-ca-polling.md
T
shankar0123 97f51cc044 docs: Phase 14 — Last reviewed line sweep across docs/
Per Phase 1 audit at cowork/docs-overhaul-phase-1-audit-2026-05-04/.
Adds a `> Last reviewed: 2026-05-05` line right after the H1 heading
of every doc that didn't already have one (41 files).

This dates the freshness clock for the future Phase 4 per-doc review.
The discipline going forward: when a doc's content gets a meaningful
edit, bump the date. When the date gets old (e.g., >6 months), the
doc earns a freshness-review pass.

Mechanical insertion via awk one-liner, applied to every docs/*.md
that didn't already match `grep -q 'Last reviewed:'`. Files that
already carried the line from earlier Phase 2 work (the navigation
index, the new connector docs, the new SCEP server / legacy-clients-
TLS-1.2 / release-verification docs, and the 5 per-connector deep
dives) were skipped to avoid duplicate insertion.

Net: every doc in docs/ now has a Last reviewed line.
2026-05-05 03:26:46 +00:00

5.2 KiB

Async-CA Polling — Operator Reference

Last reviewed: 2026-05-05

Closes audit fix #5 from the 2026-05-01 issuer-coverage acquisition-readiness audit.

What this is

Four issuer connectors talk to Certificate Authorities that issue certificates asynchronouslyIssueCertificate returns an order ID immediately, and the caller (or scheduler) must call GetOrderStatus later to retrieve the issued cert:

  • DigiCert (CertCentral)
  • Sectigo (Certificate Manager)
  • Entrust (Certificate Services / CA Gateway)
  • GlobalSign (Atlas HVCA)

Pre-fix, each connector's GetOrderStatus made one HTTP call per invocation with no exponential backoff, no retry cap, and no deadline. Under a renewal sweep, certctl would hammer the upstream CA's rate-limit budget. A 429 response was treated as a hard error, which then caused the scheduler to retry on the next tick — re-fanning out the same call that just got rate-limited.

Post-fix, GetOrderStatus blocks for up to PollMaxWait (default 10 minutes) doing bounded internal polling:

attempt 1 → wait 5s  → attempt 2 → wait 15s → attempt 3 → wait 45s →
attempt 4 → wait 2m  → attempt 5 → wait 5m  → ... (capped at 5m)

±20% jitter applied at every wait so multiple certctl instances never synchronize on the upstream CA's rate-limit window. The PollMaxWait deadline is a hard cap; if the upstream still hasn't completed by then, GetOrderStatus returns StillPending and the scheduler can re-enqueue the job for a future tick.

Status-code triage

Each connector classifies HTTP responses to drive polling decisions:

Response Meaning Decision
2xx + status="issued"/"completed" Cert ready Done — return the cert
2xx + status="pending"/"processing" Still working StillPending — keep polling
2xx + status="rejected"/"denied"/"failed" Permanent Done — return OrderStatus{Status:"failed"}
2xx + parse failure Body is broken Failed — return error
4xx (404/400/401/403) Permanent client error Failed — return error
429 (rate limited) Transient StillPending — keep polling with backoff
5xx Transient StillPending — keep polling with backoff
Network / TLS error Transient StillPending — keep polling with backoff

Operator tuning

Each connector exposes a PollMaxWaitSeconds config field and matching env var:

Connector Env var Default
DigiCert CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS 600 (10m)
Sectigo CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS 600 (10m)
Entrust CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS 600 (10m)
GlobalSign CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS 600 (10m)

Tune up (e.g., 86400 = 24 hours) for Entrust approval-pending workflows where humans manually approve enrollments. Tune down (e.g., 60) for high-throughput environments that prefer to recycle the scheduler tick rather than block one renewal goroutine for minutes.

A value of 0 (or unset) falls back to the package default in internal/connector/issuer/asyncpoll.

Failure modes

Upstream returns 429 forever. The Poller respects the backoff (5s → 15s → 45s → 2m → 5m), so a sustained 429 stream burns through the full PollMaxWait budget with at most 7-8 attempts (instead of ~600 attempts at 1/sec). After PollMaxWait expires, GetOrderStatus returns StillPending; the scheduler re-enqueues for the next tick. The total request volume against the upstream is bounded by tick interval / minimum backoff — typically 1-2 requests per minute even under heavy load.

Sectigo collectNotReady sentinel. When the SCM status endpoint reports Issued but the cert collect endpoint isn't yet ready, the old code branched into a special "pending" return. Now that branch returns StillPending from the poll closure, so the cert collection rides the same backoff schedule.

Entrust approval-pending. The AWAITING_APPROVAL status maps to StillPending. With the default PollMaxWait=10m, the scheduler will re-enqueue once per tick if approval hasn't happened yet; with PollMaxWait=24h the same renewal goroutine waits the full approval window. Pick the latter when you have many approval-pending enrollments per tick.

Where the implementation lives

  • internal/connector/issuer/asyncpoll/asyncpoll.go — shared Poller with backoff math, jitter, deadline, and ctx-aware cancellation.
  • internal/connector/issuer/digicert/digicert.gopollOrderOnce + GetOrderStatus orchestrator.
  • internal/connector/issuer/sectigo/sectigo.gopollEnrollmentOnce + status-code permanence triage (isPermanentStatusError).
  • internal/connector/issuer/entrust/entrust.gopollEnrollmentOnce + approval-pending mapping.
  • internal/connector/issuer/globalsign/globalsign.gopollCertificateOnce (serial-number tracking).
  • internal/connector/issuer/asyncpoll/asyncpoll_test.go — 11 unit tests covering happy path, transient-then-success, Failed termination, MaxWait timeout, last-error wrap, ctx cancel, multiplicative backoff, jitter bounds, defaults.

Audit blocker reference

cowork/issuer-coverage-audit-2026-05-01/RESULTS.md, Top-10 fix #5 (Part 1.5 finding #4: "No polling backoff for async CAs").