scheduler: bound renewal concurrency via CERTCTL_RENEWAL_CONCURRENCY

Closes the #9 acquisition-readiness blocker from the 2026-05-01 issuer coverage audit. Pre-fix, JobService.ProcessPendingJobs ran every claimed job sequentially in a single goroutine: safe but slow, and operators with large fleets had no lever to dial throughput up. Switching to fire-and-forget per-job goroutines would have unbounded the upstream-CA call rate and tripped DigiCert / Entrust / Sectigo rate limits — certctl's response to 429 was to retry on the next tick, re-fanning out the same calls and digging deeper into the limit. Operators need a knob. This commit: - Adds CERTCTL_RENEWAL_CONCURRENCY env var (default 25) loaded via the existing getEnvInt pattern in internal/config/config.go. Documented inline as the cap for the per-tick renewal/issuance/ deployment goroutine fan-out, with operator-tuning guidance: permissive upstream limits + large fleets (>10k certs) → 100; strict limits or async-CA-heavy fleets → 25 or lower. - Wires golang.org/x/sync/semaphore.Weighted around the per-job goroutine launch in JobService.ProcessPendingJobs. Acquire(ctx, 1) is the load-bearing piece — it BLOCKS the loop when at the cap, providing real backpressure rather than fire-and-forget. The fan-out is split into processPendingJobsSequential (legacy, preserved for unit-test wiring that doesn't call SetRenewalConcurrency) and processPendingJobsConcurrent (production, delegates to a generic boundedFanOut helper). - boundedFanOut takes the per-job work as a closure so the cap can be tested directly without standing up the renewal/deployment service graph. processed/failed counters use atomic.Int64 to avoid mutex overhead on every job completion; final log line reads both AFTER wg.Wait so the counts reflect every dispatched job. ctx-aware Acquire ensures a shutdown ctx cancel interrupts the dispatch loop promptly; in-flight goroutines drain via Wait before the function returns so no goroutine outlives the scheduler tick. - shouldSkipJob extracted as a package-private helper so the agent-routed-deployment skip logic is shared between the sequential and concurrent paths byte-for-byte (the audit prompt's "channel-based semaphore without ctx-aware acquire" anti-pattern is explicitly avoided — semaphore.Weighted.Acquire returns on ctx done; channel <- struct{}{} would block forever). - SetRenewalConcurrency setter on JobService normalises ≤0 to 1. semaphore.NewWeighted(0) constructs a semaphore that blocks every Acquire forever; the normalisation prevents a misconfigured env var from wedging the scheduler. - cmd/server/main.go wires SetRenewalConcurrency(cfg.Scheduler. RenewalConcurrency) on the freshly-built jobService, immediately after SetAuditService. Production deployments always take the bounded path; tests that build JobService directly via NewJobService keep their strict-sequential behaviour because renewalConcurrency is the zero value. - Tests in internal/service/job_concurrency_test.go: * TestBoundedFanOut_CapHolds — primary regression guard. 50 jobs × 50ms work × cap=5 → asserts peak in-flight never exceeds 5 AND reaches 5 at least once (catches both upper-bound regressions and gates that incorrectly cap below the configured value). Lock-free max via CompareAndSwap so the measurement instrument doesn't itself constrain concurrency. * TestBoundedFanOut_AllJobsRun — lower-bound: every non-skipped job is dispatched. * TestBoundedFanOut_SkipsAgentRoutedDeployments — pins the shouldSkipJob contract. * TestBoundedFanOut_CtxCancelInterrupts — ctx cancellation interrupts a stuck fan-out within the timeout budget. * TestBoundedFanOut_FailedJobsCounted — per-job errors don't abort the fan-out. * TestSetRenewalConcurrency_NormalizesNonPositive — ≤0 → 1 fail-safe pinned across negative/zero/positive inputs. - docs/features.md: scheduler-loop table augmented with the concurrency-cap env-var pointer alongside the job-processor row. - docs/architecture.md: Concurrency Safety section gains a paragraph explaining the cap, the operator-tuning guidance, the ctx-aware Acquire semantics, and the audit reference. Operator-facing impact: the first big renewal sweep no longer takes down the upstream CA's rate-limit budget. Existing deployments get the bounded path automatically (default 25); operators can override via env var without code changes. Verified locally: - gofmt -l . clean - go vet ./... clean - staticcheck ./... clean - go test -short -count=1 across service / scheduler / config / integration: green - Six new tests under TestBoundedFanOut* + TestSetRenewalConcurrency*: green Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md Top-10 fix #9.
2026-06-07 13:51:36 +00:00 · 2026-05-02 14:12:30 +00:00
parent 3a665ae6ba
commit 35e18bfc56
8 changed files with 427 additions and 9 deletions
@@ -1143,6 +1143,25 @@ type SchedulerConfig struct {
 	// Setting: CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL environment variable.
 	JobProcessorInterval time.Duration

+	// RenewalConcurrency caps the number of concurrent renewal/issuance/
+	// deployment goroutines launched per job-processor tick. Default 25 —
+	// high enough to make use of HTTP/1.1 connection reuse against an
+	// upstream CA, low enough to stay under typical per-customer rate
+	// limits. Operators with permissive upstream limits and large fleets
+	// (>10k certs) can bump to 100; operators with strict limits or
+	// async-CA-heavy fleets should keep at 25 or lower.
+	//
+	// Values ≤ 0 fall back to 1 (sequential) — fail-safe rather than
+	// panicking on semaphore.NewWeighted(0) semantics.
+	//
+	// Closes the #9 acquisition-readiness blocker from the 2026-05-01
+	// issuer coverage audit. Pre-fix the per-tick fan-out had no cap,
+	// so a 5k-cert sweep launched 5k in-flight HTTP calls to upstream
+	// CAs and tripped DigiCert/Entrust/Sectigo rate limits.
+	//
+	// Setting: CERTCTL_RENEWAL_CONCURRENCY environment variable.
+	RenewalConcurrency int
+
 	// AgentHealthCheckInterval is how often the scheduler checks agent heartbeats.
 	// Default: 2 minutes. Minimum: 1 second. Marks agents offline if no recent heartbeat.
 	// Setting: CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL environment variable.
@@ -1434,8 +1453,11 @@ func Load() (*Config, error) {
 			DemoSeed:       getEnvBool("CERTCTL_DEMO_SEED", false),
 		},
 		Scheduler: SchedulerConfig{
-			RenewalCheckInterval:        getEnvDuration("CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL", 1*time.Hour),
-			JobProcessorInterval:        getEnvDuration("CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL", 30*time.Second),
+			RenewalCheckInterval: getEnvDuration("CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL", 1*time.Hour),
+			JobProcessorInterval: getEnvDuration("CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL", 30*time.Second),
+			// Audit fix #9 — per-tick concurrency cap on the renewal/issuance/
+			// deployment goroutine fan-out. ≤0 → 1 (sequential).
+			RenewalConcurrency:          getEnvInt("CERTCTL_RENEWAL_CONCURRENCY", 25),
 			AgentHealthCheckInterval:    getEnvDuration("CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL", 2*time.Minute),
 			NotificationProcessInterval: getEnvDuration("CERTCTL_SCHEDULER_NOTIFICATION_PROCESS_INTERVAL", 1*time.Minute),
 			// I-005: retry sweep for failed notifications. Mirrors RetryInterval