fix(scheduler): SCALE-001 — cap ClaimPendingJobs per-tick (default 1000)

Sprint 2 unified-master-audit closure. Pre-fix the scheduler invoked
ClaimPendingJobs(ctx, "", 0). limit:0 loads every Pending row in a
single transaction — a 100K-job burst (cert-fleet sweep, post-outage
recovery, large agent-fleet first boot) marshalled the full queue
into process memory before boundedFanOut's semaphore could back-
pressure the upstream CAs.

Fix:
  - SchedulerConfig.JobClaimLimit (env CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT,
    default 1000). ≤0 normalised to 1000 in SetClaimLimit — fail-safe
    vs. legacy unlimited semantics.
  - JobService.claimLimit threaded into the existing
    ProcessPendingJobs flow; ClaimPendingJobs(ctx, "", s.claimLimit).
  - cmd/server/main.go wires jobService.SetClaimLimit(cfg.Scheduler.JobClaimLimit).
  - 'processing pending jobs' log line now includes claim_limit so
    operators can spot the cap engaging (count == claim_limit ⇒
    queue is running ahead of fan-out; bump CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT
    or CERTCTL_RENEWAL_CONCURRENCY).
  - Test wiring keeps the legacy zero-value (unlimited) for byte-
    for-byte compatibility with the existing 600+ JobService unit
    tests — only production code goes through SetClaimLimit.

Regression coverage:
  - mockJobRepo.LastClaimLimit records the limit passed through
    ClaimPendingJobs so tests can pin the propagation.
  - TestProcessPendingJobs_RespectsClaimLimit: 10 Pending rows,
    SetClaimLimit(3), expect exactly 3 transition to Running plus
    LastClaimLimit=3 on the mock.
  - TestSetClaimLimit_NormalisesNonPositive: 0/-1/-1000 all
    normalise to 1000.

Closes SCALE-001.
This commit is contained in:
shankar0123
2026-05-16 04:00:49 +00:00
parent 7d2e7043b9
commit 037876fa0f
6 changed files with 166 additions and 4 deletions
+5
View File
@@ -808,6 +808,11 @@ func main() {
// CERTCTL_RENEWAL_CONCURRENCY; ≤0 normalised to 1 (sequential)
// inside the setter.
jobService.SetRenewalConcurrency(cfg.Scheduler.RenewalConcurrency)
// SCALE-001 closure (Sprint 2, 2026-05-16): per-tick ClaimPendingJobs
// cap so 100K-job bursts don't materialise the full queue into
// memory before the bounded fan-out engages. Setting normalises ≤0
// to 1000 (fail-safe vs. legacy unlimited semantics).
jobService.SetClaimLimit(cfg.Scheduler.JobClaimLimit)
agentService := service.NewAgentService(agentRepo, certificateRepo, jobRepo, targetRepo, auditService, issuerRegistry, renewalService)
agentService.SetProfileRepo(profileRepo)
issuerService := service.NewIssuerService(issuerRepo, auditService, issuerRegistry, encryptionKey, logger)