scheduler: bound renewal concurrency via CERTCTL_RENEWAL_CONCURRENCY

Closes the #9 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Pre-fix, JobService.ProcessPendingJobs ran every
claimed job sequentially in a single goroutine: safe but slow, and
operators with large fleets had no lever to dial throughput up.
Switching to fire-and-forget per-job goroutines would have unbounded
the upstream-CA call rate and tripped DigiCert / Entrust / Sectigo
rate limits — certctl's response to 429 was to retry on the next
tick, re-fanning out the same calls and digging deeper into the
limit. Operators need a knob.

This commit:

- Adds CERTCTL_RENEWAL_CONCURRENCY env var (default 25) loaded via
  the existing getEnvInt pattern in internal/config/config.go.
  Documented inline as the cap for the per-tick renewal/issuance/
  deployment goroutine fan-out, with operator-tuning guidance:
  permissive upstream limits + large fleets (>10k certs) → 100;
  strict limits or async-CA-heavy fleets → 25 or lower.

- Wires golang.org/x/sync/semaphore.Weighted around the per-job
  goroutine launch in JobService.ProcessPendingJobs. Acquire(ctx, 1)
  is the load-bearing piece — it BLOCKS the loop when at the cap,
  providing real backpressure rather than fire-and-forget. The
  fan-out is split into processPendingJobsSequential (legacy,
  preserved for unit-test wiring that doesn't call
  SetRenewalConcurrency) and processPendingJobsConcurrent (production,
  delegates to a generic boundedFanOut helper).

- boundedFanOut takes the per-job work as a closure so the cap can
  be tested directly without standing up the renewal/deployment
  service graph. processed/failed counters use atomic.Int64 to
  avoid mutex overhead on every job completion; final log line
  reads both AFTER wg.Wait so the counts reflect every dispatched
  job. ctx-aware Acquire ensures a shutdown ctx cancel interrupts
  the dispatch loop promptly; in-flight goroutines drain via Wait
  before the function returns so no goroutine outlives the
  scheduler tick.

- shouldSkipJob extracted as a package-private helper so the
  agent-routed-deployment skip logic is shared between the
  sequential and concurrent paths byte-for-byte (the audit prompt's
  "channel-based semaphore without ctx-aware acquire" anti-pattern
  is explicitly avoided — semaphore.Weighted.Acquire returns on ctx
  done; channel <- struct{}{} would block forever).

- SetRenewalConcurrency setter on JobService normalises ≤0 to 1.
  semaphore.NewWeighted(0) constructs a semaphore that blocks every
  Acquire forever; the normalisation prevents a misconfigured env
  var from wedging the scheduler.

- cmd/server/main.go wires SetRenewalConcurrency(cfg.Scheduler.
  RenewalConcurrency) on the freshly-built jobService, immediately
  after SetAuditService. Production deployments always take the
  bounded path; tests that build JobService directly via
  NewJobService keep their strict-sequential behaviour because
  renewalConcurrency is the zero value.

- Tests in internal/service/job_concurrency_test.go:
  * TestBoundedFanOut_CapHolds — primary regression guard. 50 jobs
    × 50ms work × cap=5 → asserts peak in-flight never exceeds 5
    AND reaches 5 at least once (catches both upper-bound regressions
    and gates that incorrectly cap below the configured value).
    Lock-free max via CompareAndSwap so the measurement instrument
    doesn't itself constrain concurrency.
  * TestBoundedFanOut_AllJobsRun — lower-bound: every non-skipped
    job is dispatched.
  * TestBoundedFanOut_SkipsAgentRoutedDeployments — pins the
    shouldSkipJob contract.
  * TestBoundedFanOut_CtxCancelInterrupts — ctx cancellation
    interrupts a stuck fan-out within the timeout budget.
  * TestBoundedFanOut_FailedJobsCounted — per-job errors don't
    abort the fan-out.
  * TestSetRenewalConcurrency_NormalizesNonPositive — ≤0 → 1 fail-safe
    pinned across negative/zero/positive inputs.

- docs/features.md: scheduler-loop table augmented with the
  concurrency-cap env-var pointer alongside the job-processor row.

- docs/architecture.md: Concurrency Safety section gains a paragraph
  explaining the cap, the operator-tuning guidance, the ctx-aware
  Acquire semantics, and the audit reference.

Operator-facing impact: the first big renewal sweep no longer
takes down the upstream CA's rate-limit budget. Existing deployments
get the bounded path automatically (default 25); operators can
override via env var without code changes.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- go test -short -count=1 across service / scheduler / config /
  integration: green
- Six new tests under TestBoundedFanOut* + TestSetRenewalConcurrency*:
  green

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #9.
This commit is contained in:
shankar0123
2026-05-02 14:12:30 +00:00
parent 3a665ae6ba
commit 35e18bfc56
8 changed files with 427 additions and 9 deletions
+1 -1
View File
@@ -1205,7 +1205,7 @@ Single SQL `UNION` query replaces the previous "fetch all, filter in Go" approac
| Loop | Default Interval | Always-on | Env Var | Description |
|---|---|---|---|---|
| Renewal check | 1 hour | Yes | `CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL` | Check expiring certs, query ARI, create renewal jobs |
| Job processor | 30 seconds | Yes | `CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL` | Process pending jobs |
| Job processor | 30 seconds | Yes | `CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL` | Process pending jobs (concurrency cap via `CERTCTL_RENEWAL_CONCURRENCY`, default 25) |
| Job retry | 5 minutes | Yes | `CERTCTL_SCHEDULER_RETRY_INTERVAL` | Retry Failed jobs (I-001) |
| Job timeout reaper | 10 minutes | Yes | `CERTCTL_JOB_TIMEOUT_INTERVAL` (per-state thresholds: `CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT`, `CERTCTL_JOB_AWAITING_CSR_TIMEOUT`) | Fail AwaitingCSR/AwaitingApproval jobs past timeout (I-003) |
| Agent health check | 2 minutes | Yes | `CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL` | Check agent heartbeat staleness |