mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 23:51:41 +00:00
b375df767e
Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.
35 files moved across 8 audience-organized subdirectories:
docs/getting-started/ (5):
quickstart.md, concepts.md, examples.md, advanced-demo.md (was
demo-advanced.md), why-certctl.md
docs/reference/ (6):
architecture.md, api.md (was openapi.md), mcp.md,
intermediate-ca-hierarchy.md, deployment-model.md (was
deployment-atomicity.md), vendor-matrix.md (was
deployment-vendor-matrix.md)
docs/reference/protocols/ (6):
acme-server.md, acme-server-threat-model.md, scep-intune.md,
est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)
docs/operator/ (4):
security.md, tls.md, database-tls.md, approval-workflow.md
docs/operator/runbooks/ (3):
cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
(was runbook-expiry-alerts.md), disaster-recovery.md
docs/migration/ (3):
from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
(was migrate-from-acmesh.md), cert-manager-coexistence.md (was
certctl-for-cert-manager-users.md)
docs/compliance/ (4):
index.md (was compliance.md), soc2.md (was compliance-soc2.md),
pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
compliance-nist.md)
docs/contributor/ (4):
testing-strategy.md, test-environment.md (was test-env.md),
ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)
Deferred to later Phase 2 sub-phases:
- connectors.md split (Phase 4): docs/connectors.md +
docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
- testing-guide.md prune (Phase 5): docs/testing-guide.md still
at top level
- features.md disperse (Phase 6): docs/features.md still at top
level
- legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
still at top level
- ACME walkthrough re-homing (Phase 8): three
docs/acme-*-walkthrough.md still at top level
- Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
at top level
Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
119 lines
5.2 KiB
Markdown
119 lines
5.2 KiB
Markdown
# Async-CA Polling — Operator Reference
|
|
|
|
Closes audit fix #5 from the 2026-05-01 issuer-coverage acquisition-readiness audit.
|
|
|
|
## What this is
|
|
|
|
Four issuer connectors talk to Certificate Authorities that issue
|
|
certificates **asynchronously** — `IssueCertificate` returns an order
|
|
ID immediately, and the caller (or scheduler) must call
|
|
`GetOrderStatus` later to retrieve the issued cert:
|
|
|
|
- **DigiCert** (CertCentral)
|
|
- **Sectigo** (Certificate Manager)
|
|
- **Entrust** (Certificate Services / CA Gateway)
|
|
- **GlobalSign** (Atlas HVCA)
|
|
|
|
Pre-fix, each connector's `GetOrderStatus` made one HTTP call per
|
|
invocation with no exponential backoff, no retry cap, and no deadline.
|
|
Under a renewal sweep, certctl would hammer the upstream CA's
|
|
rate-limit budget. A 429 response was treated as a hard error,
|
|
which then caused the scheduler to retry on the next tick — re-fanning
|
|
out the same call that just got rate-limited.
|
|
|
|
Post-fix, `GetOrderStatus` blocks for up to `PollMaxWait` (default
|
|
10 minutes) doing **bounded internal polling**:
|
|
|
|
```
|
|
attempt 1 → wait 5s → attempt 2 → wait 15s → attempt 3 → wait 45s →
|
|
attempt 4 → wait 2m → attempt 5 → wait 5m → ... (capped at 5m)
|
|
```
|
|
|
|
±20% jitter applied at every wait so multiple certctl instances
|
|
never synchronize on the upstream CA's rate-limit window. The
|
|
`PollMaxWait` deadline is a hard cap; if the upstream still hasn't
|
|
completed by then, `GetOrderStatus` returns `StillPending` and the
|
|
scheduler can re-enqueue the job for a future tick.
|
|
|
|
## Status-code triage
|
|
|
|
Each connector classifies HTTP responses to drive polling decisions:
|
|
|
|
| Response | Meaning | Decision |
|
|
|---|---|---|
|
|
| 2xx + status="issued"/"completed" | Cert ready | Done — return the cert |
|
|
| 2xx + status="pending"/"processing" | Still working | StillPending — keep polling |
|
|
| 2xx + status="rejected"/"denied"/"failed" | Permanent | Done — return `OrderStatus{Status:"failed"}` |
|
|
| 2xx + parse failure | Body is broken | Failed — return error |
|
|
| 4xx (404/400/401/403) | Permanent client error | Failed — return error |
|
|
| 429 (rate limited) | Transient | StillPending — keep polling with backoff |
|
|
| 5xx | Transient | StillPending — keep polling with backoff |
|
|
| Network / TLS error | Transient | StillPending — keep polling with backoff |
|
|
|
|
## Operator tuning
|
|
|
|
Each connector exposes a `PollMaxWaitSeconds` config field and
|
|
matching env var:
|
|
|
|
| Connector | Env var | Default |
|
|
|---|---|---|
|
|
| DigiCert | `CERTCTL_DIGICERT_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
|
| Sectigo | `CERTCTL_SECTIGO_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
|
| Entrust | `CERTCTL_ENTRUST_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
|
| GlobalSign | `CERTCTL_GLOBALSIGN_POLL_MAX_WAIT_SECONDS` | 600 (10m) |
|
|
|
|
Tune up (e.g., `86400` = 24 hours) for **Entrust approval-pending
|
|
workflows** where humans manually approve enrollments. Tune down (e.g.,
|
|
`60`) for high-throughput environments that prefer to recycle the
|
|
scheduler tick rather than block one renewal goroutine for minutes.
|
|
|
|
A value of 0 (or unset) falls back to the package default in
|
|
`internal/connector/issuer/asyncpoll`.
|
|
|
|
## Failure modes
|
|
|
|
**Upstream returns 429 forever.** The Poller respects the backoff
|
|
(5s → 15s → 45s → 2m → 5m), so a sustained 429 stream burns through
|
|
the full `PollMaxWait` budget with at most 7-8 attempts (instead of
|
|
~600 attempts at 1/sec). After `PollMaxWait` expires, `GetOrderStatus`
|
|
returns `StillPending`; the scheduler re-enqueues for the next tick.
|
|
The total request volume against the upstream is bounded by `tick
|
|
interval / minimum backoff` — typically 1-2 requests per minute even
|
|
under heavy load.
|
|
|
|
**Sectigo `collectNotReady` sentinel.** When the SCM status endpoint
|
|
reports `Issued` but the cert collect endpoint isn't yet ready, the
|
|
old code branched into a special "pending" return. Now that branch
|
|
returns `StillPending` from the poll closure, so the cert collection
|
|
rides the same backoff schedule.
|
|
|
|
**Entrust approval-pending.** The `AWAITING_APPROVAL` status maps to
|
|
`StillPending`. With the default `PollMaxWait=10m`, the scheduler
|
|
will re-enqueue once per tick if approval hasn't happened yet; with
|
|
`PollMaxWait=24h` the same renewal goroutine waits the full approval
|
|
window. Pick the latter when you have many approval-pending
|
|
enrollments per tick.
|
|
|
|
## Where the implementation lives
|
|
|
|
- `internal/connector/issuer/asyncpoll/asyncpoll.go` — shared `Poller`
|
|
with backoff math, jitter, deadline, and ctx-aware cancellation.
|
|
- `internal/connector/issuer/digicert/digicert.go` —
|
|
`pollOrderOnce` + `GetOrderStatus` orchestrator.
|
|
- `internal/connector/issuer/sectigo/sectigo.go` —
|
|
`pollEnrollmentOnce` + status-code permanence triage
|
|
(`isPermanentStatusError`).
|
|
- `internal/connector/issuer/entrust/entrust.go` —
|
|
`pollEnrollmentOnce` + approval-pending mapping.
|
|
- `internal/connector/issuer/globalsign/globalsign.go` —
|
|
`pollCertificateOnce` (serial-number tracking).
|
|
- `internal/connector/issuer/asyncpoll/asyncpoll_test.go` — 11 unit
|
|
tests covering happy path, transient-then-success, Failed
|
|
termination, MaxWait timeout, last-error wrap, ctx cancel,
|
|
multiplicative backoff, jitter bounds, defaults.
|
|
|
|
## Audit blocker reference
|
|
|
|
cowork/issuer-coverage-audit-2026-05-01/RESULTS.md, Top-10 fix #5
|
|
(Part 1.5 finding #4: "No polling backoff for async CAs").
|