fix(scheduler): SCALE-001 — cap ClaimPendingJobs per-tick (default 1000)

Sprint 2 unified-master-audit closure. Pre-fix the scheduler invoked ClaimPendingJobs(ctx, "", 0). limit:0 loads every Pending row in a single transaction — a 100K-job burst (cert-fleet sweep, post-outage recovery, large agent-fleet first boot) marshalled the full queue into process memory before boundedFanOut's semaphore could back- pressure the upstream CAs. Fix: - SchedulerConfig.JobClaimLimit (env CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT, default 1000). ≤0 normalised to 1000 in SetClaimLimit — fail-safe vs. legacy unlimited semantics. - JobService.claimLimit threaded into the existing ProcessPendingJobs flow; ClaimPendingJobs(ctx, "", s.claimLimit). - cmd/server/main.go wires jobService.SetClaimLimit(cfg.Scheduler.JobClaimLimit). - 'processing pending jobs' log line now includes claim_limit so operators can spot the cap engaging (count == claim_limit ⇒ queue is running ahead of fan-out; bump CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT or CERTCTL_RENEWAL_CONCURRENCY). - Test wiring keeps the legacy zero-value (unlimited) for byte- for-byte compatibility with the existing 600+ JobService unit tests — only production code goes through SetClaimLimit. Regression coverage: - mockJobRepo.LastClaimLimit records the limit passed through ClaimPendingJobs so tests can pin the propagation. - TestProcessPendingJobs_RespectsClaimLimit: 10 Pending rows, SetClaimLimit(3), expect exactly 3 transition to Running plus LastClaimLimit=3 on the mock. - TestSetClaimLimit_NormalisesNonPositive: 0/-1/-1000 all normalise to 1000. Closes SCALE-001.
2026-06-07 15:01:32 +00:00 · 2026-05-16 04:00:49 +00:00
parent 7d2e7043b9
commit 037876fa0f
6 changed files with 166 additions and 4 deletions
@@ -170,6 +170,26 @@ type SchedulerConfig struct {
 	// Setting: CERTCTL_RENEWAL_CONCURRENCY environment variable.
 	RenewalConcurrency int

+	// JobClaimLimit caps the number of Pending rows a single
+	// scheduler tick may claim via repository.JobRepository.ClaimPendingJobs.
+	// Default 1000.
+	//
+	// SCALE-001 closure (Sprint 2, 2026-05-16). Pre-fix the scheduler
+	// invoked ClaimPendingJobs with limit:0, which loads every Pending
+	// row in a single transaction. A 100K-job burst (cert-fleet sweep,
+	// post-outage recovery, etc.) would marshal the full queue into
+	// process memory before boundedFanOut's semaphore could back-
+	// pressure the upstream CAs. Capping the claim per tick keeps
+	// memory bounded; the next tick (JobProcessorInterval=30s default)
+	// picks up the rest.
+	//
+	// Operator-tune: bump for very-large-fleet deploys where 1000
+	// per 30s isn't enough throughput. Values ≤ 0 fall back to 1000
+	// rather than the legacy unlimited semantics — fail-safe.
+	//
+	// Setting: CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT environment variable.
+	JobClaimLimit int
+
 	// AgentHealthCheckInterval is how often the scheduler checks agent heartbeats.
 	// Default: 2 minutes. Minimum: 1 second. Marks agents offline if no recent heartbeat.
 	// Setting: CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL environment variable.