mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 15:01:32 +00:00
fix(scheduler): SCALE-001 — cap ClaimPendingJobs per-tick (default 1000)
Sprint 2 unified-master-audit closure. Pre-fix the scheduler invoked
ClaimPendingJobs(ctx, "", 0). limit:0 loads every Pending row in a
single transaction — a 100K-job burst (cert-fleet sweep, post-outage
recovery, large agent-fleet first boot) marshalled the full queue
into process memory before boundedFanOut's semaphore could back-
pressure the upstream CAs.
Fix:
- SchedulerConfig.JobClaimLimit (env CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT,
default 1000). ≤0 normalised to 1000 in SetClaimLimit — fail-safe
vs. legacy unlimited semantics.
- JobService.claimLimit threaded into the existing
ProcessPendingJobs flow; ClaimPendingJobs(ctx, "", s.claimLimit).
- cmd/server/main.go wires jobService.SetClaimLimit(cfg.Scheduler.JobClaimLimit).
- 'processing pending jobs' log line now includes claim_limit so
operators can spot the cap engaging (count == claim_limit ⇒
queue is running ahead of fan-out; bump CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT
or CERTCTL_RENEWAL_CONCURRENCY).
- Test wiring keeps the legacy zero-value (unlimited) for byte-
for-byte compatibility with the existing 600+ JobService unit
tests — only production code goes through SetClaimLimit.
Regression coverage:
- mockJobRepo.LastClaimLimit records the limit passed through
ClaimPendingJobs so tests can pin the propagation.
- TestProcessPendingJobs_RespectsClaimLimit: 10 Pending rows,
SetClaimLimit(3), expect exactly 3 transition to Running plus
LastClaimLimit=3 on the mock.
- TestSetClaimLimit_NormalisesNonPositive: 0/-1/-1000 all
normalise to 1000.
Closes SCALE-001.
This commit is contained in:
@@ -170,6 +170,26 @@ type SchedulerConfig struct {
|
||||
// Setting: CERTCTL_RENEWAL_CONCURRENCY environment variable.
|
||||
RenewalConcurrency int
|
||||
|
||||
// JobClaimLimit caps the number of Pending rows a single
|
||||
// scheduler tick may claim via repository.JobRepository.ClaimPendingJobs.
|
||||
// Default 1000.
|
||||
//
|
||||
// SCALE-001 closure (Sprint 2, 2026-05-16). Pre-fix the scheduler
|
||||
// invoked ClaimPendingJobs with limit:0, which loads every Pending
|
||||
// row in a single transaction. A 100K-job burst (cert-fleet sweep,
|
||||
// post-outage recovery, etc.) would marshal the full queue into
|
||||
// process memory before boundedFanOut's semaphore could back-
|
||||
// pressure the upstream CAs. Capping the claim per tick keeps
|
||||
// memory bounded; the next tick (JobProcessorInterval=30s default)
|
||||
// picks up the rest.
|
||||
//
|
||||
// Operator-tune: bump for very-large-fleet deploys where 1000
|
||||
// per 30s isn't enough throughput. Values ≤ 0 fall back to 1000
|
||||
// rather than the legacy unlimited semantics — fail-safe.
|
||||
//
|
||||
// Setting: CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT environment variable.
|
||||
JobClaimLimit int
|
||||
|
||||
// AgentHealthCheckInterval is how often the scheduler checks agent heartbeats.
|
||||
// Default: 2 minutes. Minimum: 1 second. Marks agents offline if no recent heartbeat.
|
||||
// Setting: CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL environment variable.
|
||||
|
||||
Reference in New Issue
Block a user