fix(scheduler): SCALE-001 — cap ClaimPendingJobs per-tick (default 1000)

Sprint 2 unified-master-audit closure. Pre-fix the scheduler invoked
ClaimPendingJobs(ctx, "", 0). limit:0 loads every Pending row in a
single transaction — a 100K-job burst (cert-fleet sweep, post-outage
recovery, large agent-fleet first boot) marshalled the full queue
into process memory before boundedFanOut's semaphore could back-
pressure the upstream CAs.

Fix:
  - SchedulerConfig.JobClaimLimit (env CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT,
    default 1000). ≤0 normalised to 1000 in SetClaimLimit — fail-safe
    vs. legacy unlimited semantics.
  - JobService.claimLimit threaded into the existing
    ProcessPendingJobs flow; ClaimPendingJobs(ctx, "", s.claimLimit).
  - cmd/server/main.go wires jobService.SetClaimLimit(cfg.Scheduler.JobClaimLimit).
  - 'processing pending jobs' log line now includes claim_limit so
    operators can spot the cap engaging (count == claim_limit ⇒
    queue is running ahead of fan-out; bump CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT
    or CERTCTL_RENEWAL_CONCURRENCY).
  - Test wiring keeps the legacy zero-value (unlimited) for byte-
    for-byte compatibility with the existing 600+ JobService unit
    tests — only production code goes through SetClaimLimit.

Regression coverage:
  - mockJobRepo.LastClaimLimit records the limit passed through
    ClaimPendingJobs so tests can pin the propagation.
  - TestProcessPendingJobs_RespectsClaimLimit: 10 Pending rows,
    SetClaimLimit(3), expect exactly 3 transition to Running plus
    LastClaimLimit=3 on the mock.
  - TestSetClaimLimit_NormalisesNonPositive: 0/-1/-1000 all
    normalise to 1000.

Closes SCALE-001.
This commit is contained in:
shankar0123
2026-05-16 04:00:49 +00:00
parent 7d2e7043b9
commit 037876fa0f
6 changed files with 166 additions and 4 deletions
+6 -1
View File
@@ -350,7 +350,12 @@ func Load() (*Config, error) {
JobProcessorInterval: getEnvDuration("CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL", 30*time.Second),
// Audit fix #9 — per-tick concurrency cap on the renewal/issuance/
// deployment goroutine fan-out. ≤0 → 1 (sequential).
RenewalConcurrency: getEnvInt("CERTCTL_RENEWAL_CONCURRENCY", 25),
RenewalConcurrency: getEnvInt("CERTCTL_RENEWAL_CONCURRENCY", 25),
// SCALE-001 closure (Sprint 2, 2026-05-16) — per-tick claim cap on
// the scheduler's ClaimPendingJobs sweep. Default 1000 keeps the
// fan-out busy (≈40× the renewal-concurrency cap) without
// page-thrashing on a 100K-job burst. ≤0 → 1000 (fail-safe).
JobClaimLimit: getEnvInt("CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT", 1000),
AgentHealthCheckInterval: getEnvDuration("CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL", 2*time.Minute),
NotificationProcessInterval: getEnvDuration("CERTCTL_SCHEDULER_NOTIFICATION_PROCESS_INTERVAL", 1*time.Minute),
// I-005: retry sweep for failed notifications. Mirrors RetryInterval
+20
View File
@@ -170,6 +170,26 @@ type SchedulerConfig struct {
// Setting: CERTCTL_RENEWAL_CONCURRENCY environment variable.
RenewalConcurrency int
// JobClaimLimit caps the number of Pending rows a single
// scheduler tick may claim via repository.JobRepository.ClaimPendingJobs.
// Default 1000.
//
// SCALE-001 closure (Sprint 2, 2026-05-16). Pre-fix the scheduler
// invoked ClaimPendingJobs with limit:0, which loads every Pending
// row in a single transaction. A 100K-job burst (cert-fleet sweep,
// post-outage recovery, etc.) would marshal the full queue into
// process memory before boundedFanOut's semaphore could back-
// pressure the upstream CAs. Capping the claim per tick keeps
// memory bounded; the next tick (JobProcessorInterval=30s default)
// picks up the rest.
//
// Operator-tune: bump for very-large-fleet deploys where 1000
// per 30s isn't enough throughput. Values ≤ 0 fall back to 1000
// rather than the legacy unlimited semantics — fail-safe.
//
// Setting: CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT environment variable.
JobClaimLimit int
// AgentHealthCheckInterval is how often the scheduler checks agent heartbeats.
// Default: 2 minutes. Minimum: 1 second. Marks agents offline if no recent heartbeat.
// Setting: CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL environment variable.