scheduler: bound renewal concurrency via CERTCTL_RENEWAL_CONCURRENCY

Closes the #9 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Pre-fix, JobService.ProcessPendingJobs ran every
claimed job sequentially in a single goroutine: safe but slow, and
operators with large fleets had no lever to dial throughput up.
Switching to fire-and-forget per-job goroutines would have unbounded
the upstream-CA call rate and tripped DigiCert / Entrust / Sectigo
rate limits — certctl's response to 429 was to retry on the next
tick, re-fanning out the same calls and digging deeper into the
limit. Operators need a knob.

This commit:

- Adds CERTCTL_RENEWAL_CONCURRENCY env var (default 25) loaded via
  the existing getEnvInt pattern in internal/config/config.go.
  Documented inline as the cap for the per-tick renewal/issuance/
  deployment goroutine fan-out, with operator-tuning guidance:
  permissive upstream limits + large fleets (>10k certs) → 100;
  strict limits or async-CA-heavy fleets → 25 or lower.

- Wires golang.org/x/sync/semaphore.Weighted around the per-job
  goroutine launch in JobService.ProcessPendingJobs. Acquire(ctx, 1)
  is the load-bearing piece — it BLOCKS the loop when at the cap,
  providing real backpressure rather than fire-and-forget. The
  fan-out is split into processPendingJobsSequential (legacy,
  preserved for unit-test wiring that doesn't call
  SetRenewalConcurrency) and processPendingJobsConcurrent (production,
  delegates to a generic boundedFanOut helper).

- boundedFanOut takes the per-job work as a closure so the cap can
  be tested directly without standing up the renewal/deployment
  service graph. processed/failed counters use atomic.Int64 to
  avoid mutex overhead on every job completion; final log line
  reads both AFTER wg.Wait so the counts reflect every dispatched
  job. ctx-aware Acquire ensures a shutdown ctx cancel interrupts
  the dispatch loop promptly; in-flight goroutines drain via Wait
  before the function returns so no goroutine outlives the
  scheduler tick.

- shouldSkipJob extracted as a package-private helper so the
  agent-routed-deployment skip logic is shared between the
  sequential and concurrent paths byte-for-byte (the audit prompt's
  "channel-based semaphore without ctx-aware acquire" anti-pattern
  is explicitly avoided — semaphore.Weighted.Acquire returns on ctx
  done; channel <- struct{}{} would block forever).

- SetRenewalConcurrency setter on JobService normalises ≤0 to 1.
  semaphore.NewWeighted(0) constructs a semaphore that blocks every
  Acquire forever; the normalisation prevents a misconfigured env
  var from wedging the scheduler.

- cmd/server/main.go wires SetRenewalConcurrency(cfg.Scheduler.
  RenewalConcurrency) on the freshly-built jobService, immediately
  after SetAuditService. Production deployments always take the
  bounded path; tests that build JobService directly via
  NewJobService keep their strict-sequential behaviour because
  renewalConcurrency is the zero value.

- Tests in internal/service/job_concurrency_test.go:
  * TestBoundedFanOut_CapHolds — primary regression guard. 50 jobs
    × 50ms work × cap=5 → asserts peak in-flight never exceeds 5
    AND reaches 5 at least once (catches both upper-bound regressions
    and gates that incorrectly cap below the configured value).
    Lock-free max via CompareAndSwap so the measurement instrument
    doesn't itself constrain concurrency.
  * TestBoundedFanOut_AllJobsRun — lower-bound: every non-skipped
    job is dispatched.
  * TestBoundedFanOut_SkipsAgentRoutedDeployments — pins the
    shouldSkipJob contract.
  * TestBoundedFanOut_CtxCancelInterrupts — ctx cancellation
    interrupts a stuck fan-out within the timeout budget.
  * TestBoundedFanOut_FailedJobsCounted — per-job errors don't
    abort the fan-out.
  * TestSetRenewalConcurrency_NormalizesNonPositive — ≤0 → 1 fail-safe
    pinned across negative/zero/positive inputs.

- docs/features.md: scheduler-loop table augmented with the
  concurrency-cap env-var pointer alongside the job-processor row.

- docs/architecture.md: Concurrency Safety section gains a paragraph
  explaining the cap, the operator-tuning guidance, the ctx-aware
  Acquire semantics, and the audit reference.

Operator-facing impact: the first big renewal sweep no longer
takes down the upstream CA's rate-limit budget. Existing deployments
get the bounded path automatically (default 25); operators can
override via env var without code changes.

Verified locally:
- gofmt -l . clean
- go vet ./... clean
- staticcheck ./... clean
- go test -short -count=1 across service / scheduler / config /
  integration: green
- Six new tests under TestBoundedFanOut* + TestSetRenewalConcurrency*:
  green

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #9.
This commit is contained in:
shankar0123
2026-05-02 14:12:30 +00:00
parent 3a665ae6ba
commit 35e18bfc56
8 changed files with 427 additions and 9 deletions
+141 -6
View File
@@ -6,8 +6,12 @@ import (
"fmt"
"log/slog"
"strings"
"sync"
"sync/atomic"
"time"
"golang.org/x/sync/semaphore"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/repository"
)
@@ -29,6 +33,14 @@ type JobService struct {
deploymentService *DeploymentService
auditService *AuditService
logger *slog.Logger
// renewalConcurrency caps the number of concurrent goroutines that
// ProcessPendingJobs spawns. 0 (zero-value) means "sequential" so
// existing test wiring that constructs JobService directly via
// NewJobService keeps its strict-serial behaviour. Production
// wiring calls SetRenewalConcurrency(cfg.Scheduler.RenewalConcurrency)
// to switch on the bounded fan-out. Audit fix #9.
renewalConcurrency int
}
// NewJobService creates a new job service.
@@ -56,6 +68,28 @@ func NewJobService(
}
}
// SetRenewalConcurrency wires the per-tick fan-out concurrency cap that
// ProcessPendingJobs uses. Called from cmd/server/main.go with
// cfg.Scheduler.RenewalConcurrency (default 25). Values ≤ 0 are
// normalised to 1 — fail-safe to sequential rather than panicking on
// semaphore.NewWeighted(0) which constructs a semaphore that blocks
// every Acquire.
//
// Audit fix #9: bounded scheduler concurrency. Pre-fix
// ProcessPendingJobs ran every claimed job sequentially in a single
// goroutine (slow but safe); operators with large fleets needed a
// performance lever, but switching to fire-and-forget per-job
// goroutines would have unbounded the upstream-CA call rate and
// tripped DigiCert / Entrust / Sectigo rate limits. The cap gives
// the operator a knob (CERTCTL_RENEWAL_CONCURRENCY) to dial
// throughput up without losing the rate-limit headroom.
func (s *JobService) SetRenewalConcurrency(n int) {
if n <= 0 {
n = 1
}
s.renewalConcurrency = n
}
// SetAuditService wires an optional audit service for emitting lifecycle
// events (e.g., scheduler-driven job_retry transitions recorded by
// RetryFailedJobs). Construction keeps the audit dependency optional so
@@ -88,21 +122,34 @@ func (s *JobService) ProcessPendingJobs(ctx context.Context) error {
s.logger.Info("processing pending jobs", "count", len(pendingJobs))
// Audit fix #9: bounded concurrent fan-out. When renewalConcurrency
// is the zero value (caller never set it), fall through to the
// historical strict-sequential loop so every existing test path
// keeps its byte-for-byte behaviour. Production wiring in
// cmd/server/main.go always calls SetRenewalConcurrency with the
// configured cap (default 25), so the bounded path is always taken
// in real deployments.
if s.renewalConcurrency <= 0 {
return s.processPendingJobsSequential(ctx, pendingJobs)
}
return s.processPendingJobsConcurrent(ctx, pendingJobs, s.renewalConcurrency)
}
// processPendingJobsSequential is the legacy strict-serial fan-out
// (preserved for unit-test wiring that constructs JobService via
// NewJobService and never calls SetRenewalConcurrency). One job at a
// time, no concurrency.
func (s *JobService) processPendingJobsSequential(ctx context.Context, pendingJobs []*domain.Job) error {
var processedCount int
var failedCount int
// Process each job
for _, job := range pendingJobs {
// Skip deployment jobs that have an agent_id — those are meant for agent
// pickup via GetPendingWork(), not server-side processing. The server should
// only process deployment jobs without an agent (legacy/serverless targets).
if job.Type == domain.JobTypeDeployment && job.AgentID != nil && *job.AgentID != "" {
if shouldSkipJob(job) {
s.logger.Debug("skipping agent-routed deployment job",
"job_id", job.ID,
"agent_id", *job.AgentID)
continue
}
if err := s.processJob(ctx, job); err != nil {
s.logger.Error("failed to process job",
"job_id", job.ID,
@@ -122,6 +169,94 @@ func (s *JobService) ProcessPendingJobs(ctx context.Context) error {
return nil
}
// processPendingJobsConcurrent is the production entry point for the
// bounded fan-out — it pins the per-job work function to s.processJob.
// Test code at internal/service/job_concurrency_test.go drives
// boundedFanOut directly with a counter-tracking stub so the
// concurrency cap can be asserted without standing up the full
// renewal/deployment service graph.
func (s *JobService) processPendingJobsConcurrent(ctx context.Context, pendingJobs []*domain.Job, capN int) error {
return boundedFanOut(ctx, pendingJobs, capN, s.processJob, s.logger)
}
// boundedFanOut runs `work` over `pendingJobs` with at most `capN`
// goroutines in flight at any moment, using
// golang.org/x/sync/semaphore.Weighted for the gate. The Acquire(ctx,
// 1) call BLOCKS the loop when at the cap, providing real backpressure
// rather than fire-and-forget — the scheduler tick can't outrun the
// upstream-CA rate limit. ctx cancellation propagates through Acquire
// so a context.Done() interrupts the fan-out promptly; in-flight
// goroutines drain via Wait before the function returns, so no
// goroutine outlives the scheduler tick.
//
// processed / failed are tracked via atomic.Int64 to avoid mutex
// overhead on every job completion; the final log line reads both
// after Wait, so the values reflect every dispatched job.
//
// Audit fix #9: the cap is a load-bearing pre-condition for "operators
// with permissive upstream limits and large fleets can scale up the
// renewal sweep without losing rate-limit headroom."
func boundedFanOut(ctx context.Context, pendingJobs []*domain.Job, capN int, work func(context.Context, *domain.Job) error, logger *slog.Logger) error {
var processed, failed atomic.Int64
sem := semaphore.NewWeighted(int64(capN))
var wg sync.WaitGroup
for _, job := range pendingJobs {
if shouldSkipJob(job) {
logger.Debug("skipping agent-routed deployment job",
"job_id", job.ID,
"agent_id", *job.AgentID)
continue
}
// Acquire BEFORE launching so the offered load to upstream
// CAs respects the cap regardless of how fast individual
// jobs complete.
if err := sem.Acquire(ctx, 1); err != nil {
dispatched := processed.Load() + failed.Load()
logger.Warn("renewal fan-out cancelled mid-tick",
"reason", err,
"jobs_dispatched", dispatched,
"jobs_remaining", int64(len(pendingJobs))-dispatched)
break
}
wg.Add(1)
go func(j *domain.Job) {
defer wg.Done()
defer sem.Release(1)
if err := work(ctx, j); err != nil {
logger.Error("failed to process job",
"job_id", j.ID,
"job_type", j.Type,
"error", err)
failed.Add(1)
return
}
processed.Add(1)
}(job)
}
wg.Wait()
logger.Info("job processing completed",
"processed", processed.Load(),
"failed", failed.Load(),
"total", len(pendingJobs),
"concurrency_cap", capN)
return nil
}
// shouldSkipJob returns true for deployment jobs that have an agent_id —
// those are meant for agent pickup via GetPendingWork(), not server-side
// processing. The server should only process deployment jobs without an
// agent (legacy/serverless targets).
func shouldSkipJob(job *domain.Job) bool {
return job.Type == domain.JobTypeDeployment && job.AgentID != nil && *job.AgentID != ""
}
// processJob routes a single job to the appropriate service based on type.
func (s *JobService) processJob(ctx context.Context, job *domain.Job) error {
s.logger.Debug("processing job",