mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 15:01:32 +00:00
fix(scheduler): SCALE-001 — cap ClaimPendingJobs per-tick (default 1000)
Sprint 2 unified-master-audit closure. Pre-fix the scheduler invoked
ClaimPendingJobs(ctx, "", 0). limit:0 loads every Pending row in a
single transaction — a 100K-job burst (cert-fleet sweep, post-outage
recovery, large agent-fleet first boot) marshalled the full queue
into process memory before boundedFanOut's semaphore could back-
pressure the upstream CAs.
Fix:
- SchedulerConfig.JobClaimLimit (env CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT,
default 1000). ≤0 normalised to 1000 in SetClaimLimit — fail-safe
vs. legacy unlimited semantics.
- JobService.claimLimit threaded into the existing
ProcessPendingJobs flow; ClaimPendingJobs(ctx, "", s.claimLimit).
- cmd/server/main.go wires jobService.SetClaimLimit(cfg.Scheduler.JobClaimLimit).
- 'processing pending jobs' log line now includes claim_limit so
operators can spot the cap engaging (count == claim_limit ⇒
queue is running ahead of fan-out; bump CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT
or CERTCTL_RENEWAL_CONCURRENCY).
- Test wiring keeps the legacy zero-value (unlimited) for byte-
for-byte compatibility with the existing 600+ JobService unit
tests — only production code goes through SetClaimLimit.
Regression coverage:
- mockJobRepo.LastClaimLimit records the limit passed through
ClaimPendingJobs so tests can pin the propagation.
- TestProcessPendingJobs_RespectsClaimLimit: 10 Pending rows,
SetClaimLimit(3), expect exactly 3 transition to Running plus
LastClaimLimit=3 on the mock.
- TestSetClaimLimit_NormalisesNonPositive: 0/-1/-1000 all
normalise to 1000.
Closes SCALE-001.
This commit is contained in:
@@ -207,6 +207,10 @@ type mockJobRepo struct {
|
||||
ListTimedOutErr error
|
||||
ListOfflineAgentJobsErr error
|
||||
Updated []*domain.Job
|
||||
// SCALE-001 closure (Sprint 2): records the most-recent `limit`
|
||||
// passed to ClaimPendingJobs so tests can pin the per-tick cap
|
||||
// propagation from JobService.SetClaimLimit.
|
||||
LastClaimLimit int
|
||||
}
|
||||
|
||||
func (m *mockJobRepo) List(ctx context.Context) ([]*domain.Job, error) {
|
||||
@@ -352,9 +356,13 @@ func (m *mockJobRepo) ListPendingByAgentID(ctx context.Context, agentID string)
|
||||
// ClaimPendingJobs simulates the H-6 atomic claim semantics: matching rows are transitioned
|
||||
// Pending → Running before being returned. The in-memory mock has no concurrency primitives
|
||||
// beyond the existing mutex, which is sufficient for single-goroutine service tests.
|
||||
//
|
||||
// LastClaimLimit is recorded for SCALE-001 (Sprint 2) tests that pin the
|
||||
// per-tick cap propagation from JobService.SetClaimLimit.
|
||||
func (m *mockJobRepo) ClaimPendingJobs(ctx context.Context, jobType domain.JobType, limit int) ([]*domain.Job, error) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
m.LastClaimLimit = limit
|
||||
if m.ListErr != nil {
|
||||
return nil, m.ListErr
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user