scheduler+db: close Phase 6 — scale hardening across pool, jitter, ETag, asyncpoll

Phase 6 of the certctl architecture diligence remediation. Five
findings across the same scheduler-and-DB-pool surface.

SCALE-M1 (Med) — DB pool default bumped 25 → 50
  internal/config/config.go line 1972:
    MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50)
  Postgres default max_connections is 100; 50 leaves headroom for
  pg_dump + ad-hoc psql + a server replica without exhausting the
  DB-side cap. Operator override env var unchanged. Operator-tune
  ladder for larger fleets (5K / 50K certs) lives in
  docs/operator/scale.md as starter values pending Phase 8 load
  tests — explicitly marked TBD.

SCALE-M3 (Med) — async-CA poll budget operator-configurable
  Live state was partially-already-shipped: all 4 async-CA
  connectors (digicert, entrust, globalsign, sectigo) already have
  per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5
  closed pre-Phase-6). What was missing: a global package-default
  override. Shipped:
    - internal/connector/issuer/asyncpoll/asyncpoll.go gains
      SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the
      currentDefaultMaxWait() priority resolver.
    - cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS
      at boot and calls SetDefaultMaxWait.
    - deploy/ENVIRONMENTS.md documents the new env var (G-3 guard
      green).
  Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS:
  the live code tracks wall-clock time (MaxWait), not attempt count.
  Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS)
  so the priority chain reads naturally.

SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops
  internal/scheduler/jitter.go ships NewJitteredTicker(interval,
  jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in
  internal/scheduler/scheduler.go migrated from bare time.NewTicker
  to NewJitteredTicker(interval, DefaultSchedulerJitter). Base
  intervals unchanged; only the per-tick envelope adds ±10%
  randomized delay so multiple loops with the same nominal cadence
  don't co-fire and spike CPU + DB at wall-clock boundaries.

  internal/scheduler/jitter_test.go pins:
    - Bounded envelope (each tick within ±jitterPct of interval)
    - Mean drift < 30% of nominal (sign-bug detector)
    - Stop() releases the goroutine + closes C
    - Stop() idempotent (no panic on repeat)
    - Zero-jitter behaves like time.NewTicker
    - Negative and >=1 jitterPct values clamped defensively

  CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks
  any future bare time.NewTicker in scheduler.go.

SCALE-L1 (Low) — renewal-sweep semaphore behavior documented
  docs/operator/scale.md "Scheduler tick budgets" section explains
  the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25
  default), the ctx-cancellation drain on tick-budget overrun, and
  operator tuning advice (raise concurrency + DB pool together).
  No code change — the behavior is defensible as-is per the audit.

SCALE-L2 (Low) — ETag middleware for top-5 read endpoints
  internal/api/middleware/etag.go computes SHA-256 ETag over the
  buffered response body, respects If-None-Match, short-circuits
  to 304 Not Modified on match. GET/HEAD only; non-2xx responses
  pass through unchanged. 64 KiB buffer cap degrades gracefully on
  oversized responses (no caching, body still flushes intact).

  Wired around the top-5 read endpoints via etagged() helper in
  internal/api/router/router.go:
    GET /api/v1/certificates
    GET /api/v1/agents
    GET /api/v1/jobs
    GET /api/v1/audit
    GET /api/v1/discovered-certificates

  internal/api/middleware/etag_test.go pins 11 behaviors including
  304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass,
  4xx/5xx pass-through, oversized-response degradation, wildcard
  match, HEAD-treated-like-GET, byte-equal pass-through.

Cross-cutting fixes:
  - internal/config/config_test.go::TestLoad_DefaultValues updated
    to assert the new 50 default (was 25).
  - deploy/helm/certctl/values.yaml comment corrected — agent
    pollInterval is hardcoded 30s, not env-configurable; the
    Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL
    which G-3 caught as a phantom env var.
  - asyncpoll.go reformatted by gofmt; functionally unchanged.

Verification (all pass):
  grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go    # finds 1 site
  grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go  # config default is 50
  grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md  # wired
  grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go    # 0 (all migrated)
  grep -cE 'JitteredTicker' internal/scheduler/scheduler.go         # 15
  ls internal/scheduler/jitter.go internal/api/middleware/etag.go   # both exist
  ls docs/operator/scale.md                                          # exists
  bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh          # clean
  bash scripts/ci-guards/G-3-env-docs-drift.sh                      # clean
  go test ./internal/scheduler/ ./internal/api/middleware/ \
    ./internal/connector/issuer/asyncpoll/ ./internal/config/       # 4/4 packages green

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2
This commit is contained in:
shankar0123
2026-05-14 01:23:03 +00:00
parent d6f4d5c5e8
commit 8191b1ee64
14 changed files with 1159 additions and 27 deletions
+15 -15
View File
@@ -473,7 +473,7 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
// If an error occurs, it logs the error but continues running.
// Uses atomic.Bool to prevent duplicate execution if the previous check is still running.
func (s *Scheduler) renewalCheckLoop(ctx context.Context) {
ticker := time.NewTicker(s.renewalCheckInterval)
ticker := NewJitteredTicker(s.renewalCheckInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
@@ -522,7 +522,7 @@ func (s *Scheduler) runRenewalCheck(ctx context.Context) {
// If an error occurs, it logs the error but continues running.
// Uses atomic.Bool to prevent duplicate execution if the previous job is still running.
func (s *Scheduler) jobProcessorLoop(ctx context.Context) {
ticker := time.NewTicker(s.jobProcessorInterval)
ticker := NewJitteredTicker(s.jobProcessorInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
@@ -573,7 +573,7 @@ func (s *Scheduler) runJobProcessor(ctx context.Context) {
// Uses atomic.Bool to prevent duplicate execution if the previous retry sweep
// is still running.
func (s *Scheduler) jobRetryLoop(ctx context.Context) {
ticker := time.NewTicker(s.jobRetryInterval)
ticker := NewJitteredTicker(s.jobRetryInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
@@ -628,7 +628,7 @@ func (s *Scheduler) runJobRetry(ctx context.Context) {
// retry loop then auto-promotes eligible Failed jobs back to Pending. Closes
// coverage gap I-003. Uses atomic.Bool to prevent duplicate execution.
func (s *Scheduler) jobTimeoutLoop(ctx context.Context) {
ticker := time.NewTicker(s.jobTimeoutInterval)
ticker := NewJitteredTicker(s.jobTimeoutInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
@@ -706,7 +706,7 @@ func (s *Scheduler) runJobTimeout(ctx context.Context) {
// If an error occurs, it logs the error but continues running.
// Uses atomic.Bool to prevent duplicate execution if the previous check is still running.
func (s *Scheduler) agentHealthCheckLoop(ctx context.Context) {
ticker := time.NewTicker(s.agentHealthCheckInterval)
ticker := NewJitteredTicker(s.agentHealthCheckInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
@@ -754,7 +754,7 @@ func (s *Scheduler) runAgentHealthCheck(ctx context.Context) {
// If an error occurs, it logs the error but continues running.
// Uses atomic.Bool to prevent duplicate execution if the previous process is still running.
func (s *Scheduler) notificationProcessLoop(ctx context.Context) {
ticker := time.NewTicker(s.notificationProcessInterval)
ticker := NewJitteredTicker(s.notificationProcessInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
@@ -806,7 +806,7 @@ func (s *Scheduler) runNotificationProcess(ctx context.Context) {
// Uses atomic.Bool to prevent duplicate execution if the previous retry sweep
// is still running. Mirrors the I-001 jobRetryLoop topology byte-for-byte.
func (s *Scheduler) notificationRetryLoop(ctx context.Context) {
ticker := time.NewTicker(s.notificationRetryInterval)
ticker := NewJitteredTicker(s.notificationRetryInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
@@ -861,7 +861,7 @@ func (s *Scheduler) runNotificationRetry(ctx context.Context) {
// no CRL/OCSP needed.
// Uses atomic.Bool to prevent duplicate execution if the previous check is still running.
func (s *Scheduler) shortLivedExpiryCheckLoop(ctx context.Context) {
ticker := time.NewTicker(s.shortLivedExpiryCheckInterval)
ticker := NewJitteredTicker(s.shortLivedExpiryCheckInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
@@ -909,7 +909,7 @@ func (s *Scheduler) runShortLivedExpiryCheck(ctx context.Context) {
// of configured network targets.
// Uses atomic.Bool to prevent duplicate execution if the previous scan is still running.
func (s *Scheduler) networkScanLoop(ctx context.Context) {
ticker := time.NewTicker(s.networkScanInterval)
ticker := NewJitteredTicker(s.networkScanInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
@@ -956,7 +956,7 @@ func (s *Scheduler) runNetworkScan(ctx context.Context) {
// digestLoop runs every digestInterval and generates/sends certificate digest emails.
// Uses atomic.Bool to prevent duplicate execution if the previous digest is still running.
func (s *Scheduler) digestLoop(ctx context.Context) {
ticker := time.NewTicker(s.digestInterval)
ticker := NewJitteredTicker(s.digestInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Do NOT run immediately on start for digest — wait for the first tick.
@@ -999,7 +999,7 @@ func (s *Scheduler) runDigest(ctx context.Context) {
// resource-intensive. Wait for the first tick.
// Uses atomic.Bool to prevent duplicate execution if the previous check is still running.
func (s *Scheduler) healthCheckLoop(ctx context.Context) {
ticker := time.NewTicker(s.healthCheckInterval)
ticker := NewJitteredTicker(s.healthCheckInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Do NOT run immediately on start for health checks — wait for the first tick.
@@ -1041,7 +1041,7 @@ func (s *Scheduler) runHealthCheck(ctx context.Context) {
// Runs immediately on start, then on each tick. Same idempotency pattern as networkScanLoop.
// Uses atomic.Bool to prevent duplicate execution if the previous scan is still running.
func (s *Scheduler) cloudDiscoveryLoop(ctx context.Context) {
ticker := time.NewTicker(s.cloudDiscoveryInterval)
ticker := NewJitteredTicker(s.cloudDiscoveryInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
@@ -1121,7 +1121,7 @@ func (s *Scheduler) WaitForCompletion(timeout time.Duration) error {
//
// Bundle CRL/OCSP-Responder Phase 3.
func (s *Scheduler) crlGenerationLoop(ctx context.Context) {
ticker := time.NewTicker(s.crlGenerationInterval)
ticker := NewJitteredTicker(s.crlGenerationInterval, DefaultSchedulerJitter)
defer ticker.Stop()
// Do NOT run immediately on start. CRLs are typically valid for
@@ -1171,7 +1171,7 @@ var ErrSchedulerShutdownTimeout = errors.New("scheduler graceful shutdown timeou
// sync.WaitGroup tracks the in-flight goroutine for graceful shutdown.
// Phase 5.
func (s *Scheduler) acmeGCLoop(ctx context.Context) {
ticker := time.NewTicker(s.acmeGCInterval)
ticker := NewJitteredTicker(s.acmeGCInterval, DefaultSchedulerJitter)
defer ticker.Stop()
for {
@@ -1212,7 +1212,7 @@ func (s *Scheduler) acmeGCLoop(ctx context.Context) {
// file: a stuck Postgres can't block the next tick, and concurrent
// sweeps are skipped not queued.
func (s *Scheduler) sessionGCLoop(ctx context.Context) {
ticker := time.NewTicker(s.sessionGCInterval)
ticker := NewJitteredTicker(s.sessionGCInterval, DefaultSchedulerJitter)
defer ticker.Stop()
for {