Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.
35 files moved across 8 audience-organized subdirectories:
docs/getting-started/ (5):
quickstart.md, concepts.md, examples.md, advanced-demo.md (was
demo-advanced.md), why-certctl.md
docs/reference/ (6):
architecture.md, api.md (was openapi.md), mcp.md,
intermediate-ca-hierarchy.md, deployment-model.md (was
deployment-atomicity.md), vendor-matrix.md (was
deployment-vendor-matrix.md)
docs/reference/protocols/ (6):
acme-server.md, acme-server-threat-model.md, scep-intune.md,
est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)
docs/operator/ (4):
security.md, tls.md, database-tls.md, approval-workflow.md
docs/operator/runbooks/ (3):
cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
(was runbook-expiry-alerts.md), disaster-recovery.md
docs/migration/ (3):
from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
(was migrate-from-acmesh.md), cert-manager-coexistence.md (was
certctl-for-cert-manager-users.md)
docs/compliance/ (4):
index.md (was compliance.md), soc2.md (was compliance-soc2.md),
pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
compliance-nist.md)
docs/contributor/ (4):
testing-strategy.md, test-environment.md (was test-env.md),
ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)
Deferred to later Phase 2 sub-phases:
- connectors.md split (Phase 4): docs/connectors.md +
docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
- testing-guide.md prune (Phase 5): docs/testing-guide.md still
at top level
- features.md disperse (Phase 6): docs/features.md still at top
level
- legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
still at top level
- ACME walkthrough re-homing (Phase 8): three
docs/acme-*-walkthrough.md still at top level
- Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
at top level
Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
5.2 KiB
Async-CA Polling — Operator Reference
Closes audit fix #5 from the 2026-05-01 issuer-coverage acquisition-readiness audit.
What this is
Four issuer connectors talk to Certificate Authorities that issue
certificates asynchronously — IssueCertificate returns an order
ID immediately, and the caller (or scheduler) must call
GetOrderStatus later to retrieve the issued cert:
- DigiCert (CertCentral)
- Sectigo (Certificate Manager)
- Entrust (Certificate Services / CA Gateway)
- GlobalSign (Atlas HVCA)
Pre-fix, each connector's GetOrderStatus made one HTTP call per
invocation with no exponential backoff, no retry cap, and no deadline.
Under a renewal sweep, certctl would hammer the upstream CA's
rate-limit budget. A 429 response was treated as a hard error,
which then caused the scheduler to retry on the next tick — re-fanning
out the same call that just got rate-limited.
Post-fix, GetOrderStatus blocks for up to PollMaxWait (default
10 minutes) doing bounded internal polling:
attempt 1 → wait 5s → attempt 2 → wait 15s → attempt 3 → wait 45s →
attempt 4 → wait 2m → attempt 5 → wait 5m → ... (capped at 5m)
±20% jitter applied at every wait so multiple certctl instances
never synchronize on the upstream CA's rate-limit window. The
PollMaxWait deadline is a hard cap; if the upstream still hasn't
completed by then, GetOrderStatus returns StillPending and the
scheduler can re-enqueue the job for a future tick.
Status-code triage
Each connector classifies HTTP responses to drive polling decisions:
| Response | Meaning | Decision |
|---|---|---|
| 2xx + status="issued"/"completed" | Cert ready | Done — return the cert |
| 2xx + status="pending"/"processing" | Still working | StillPending — keep polling |
| 2xx + status="rejected"/"denied"/"failed" | Permanent | Done — return OrderStatus{Status:"failed"} |
| 2xx + parse failure | Body is broken | Failed — return error |
| 4xx (404/400/401/403) | Permanent client error | Failed — return error |
| 429 (rate limited) | Transient | StillPending — keep polling with backoff |
| 5xx | Transient | StillPending — keep polling with backoff |
| Network / TLS error | Transient | StillPending — keep polling with backoff |
Operator tuning
Each connector exposes a PollMaxWaitSeconds config field and
matching env var:
| Connector | Env var | Default |
|---|---|---|
| DigiCert | CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS |
600 (10m) |
| Sectigo | CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS |
600 (10m) |
| Entrust | CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS |
600 (10m) |
| GlobalSign | CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS |
600 (10m) |
Tune up (e.g., 86400 = 24 hours) for Entrust approval-pending
workflows where humans manually approve enrollments. Tune down (e.g.,
60) for high-throughput environments that prefer to recycle the
scheduler tick rather than block one renewal goroutine for minutes.
A value of 0 (or unset) falls back to the package default in
internal/connector/issuer/asyncpoll.
Failure modes
Upstream returns 429 forever. The Poller respects the backoff
(5s → 15s → 45s → 2m → 5m), so a sustained 429 stream burns through
the full PollMaxWait budget with at most 7-8 attempts (instead of
~600 attempts at 1/sec). After PollMaxWait expires, GetOrderStatus
returns StillPending; the scheduler re-enqueues for the next tick.
The total request volume against the upstream is bounded by tick interval / minimum backoff — typically 1-2 requests per minute even
under heavy load.
Sectigo collectNotReady sentinel. When the SCM status endpoint
reports Issued but the cert collect endpoint isn't yet ready, the
old code branched into a special "pending" return. Now that branch
returns StillPending from the poll closure, so the cert collection
rides the same backoff schedule.
Entrust approval-pending. The AWAITING_APPROVAL status maps to
StillPending. With the default PollMaxWait=10m, the scheduler
will re-enqueue once per tick if approval hasn't happened yet; with
PollMaxWait=24h the same renewal goroutine waits the full approval
window. Pick the latter when you have many approval-pending
enrollments per tick.
Where the implementation lives
internal/connector/issuer/asyncpoll/asyncpoll.go— sharedPollerwith backoff math, jitter, deadline, and ctx-aware cancellation.internal/connector/issuer/digicert/digicert.go—pollOrderOnce+GetOrderStatusorchestrator.internal/connector/issuer/sectigo/sectigo.go—pollEnrollmentOnce+ status-code permanence triage (isPermanentStatusError).internal/connector/issuer/entrust/entrust.go—pollEnrollmentOnce+ approval-pending mapping.internal/connector/issuer/globalsign/globalsign.go—pollCertificateOnce(serial-number tracking).internal/connector/issuer/asyncpoll/asyncpoll_test.go— 11 unit tests covering happy path, transient-then-success, Failed termination, MaxWait timeout, last-error wrap, ctx cancel, multiplicative backoff, jitter bounds, defaults.
Audit blocker reference
cowork/issuer-coverage-audit-2026-05-01/RESULTS.md, Top-10 fix #5 (Part 1.5 finding #4: "No polling backoff for async CAs").