mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 15:01:32 +00:00
8191b1ee64
Phase 6 of the certctl architecture diligence remediation. Five
findings across the same scheduler-and-DB-pool surface.
SCALE-M1 (Med) — DB pool default bumped 25 → 50
internal/config/config.go line 1972:
MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50)
Postgres default max_connections is 100; 50 leaves headroom for
pg_dump + ad-hoc psql + a server replica without exhausting the
DB-side cap. Operator override env var unchanged. Operator-tune
ladder for larger fleets (5K / 50K certs) lives in
docs/operator/scale.md as starter values pending Phase 8 load
tests — explicitly marked TBD.
SCALE-M3 (Med) — async-CA poll budget operator-configurable
Live state was partially-already-shipped: all 4 async-CA
connectors (digicert, entrust, globalsign, sectigo) already have
per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5
closed pre-Phase-6). What was missing: a global package-default
override. Shipped:
- internal/connector/issuer/asyncpoll/asyncpoll.go gains
SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the
currentDefaultMaxWait() priority resolver.
- cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS
at boot and calls SetDefaultMaxWait.
- deploy/ENVIRONMENTS.md documents the new env var (G-3 guard
green).
Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS:
the live code tracks wall-clock time (MaxWait), not attempt count.
Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS)
so the priority chain reads naturally.
SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops
internal/scheduler/jitter.go ships NewJitteredTicker(interval,
jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in
internal/scheduler/scheduler.go migrated from bare time.NewTicker
to NewJitteredTicker(interval, DefaultSchedulerJitter). Base
intervals unchanged; only the per-tick envelope adds ±10%
randomized delay so multiple loops with the same nominal cadence
don't co-fire and spike CPU + DB at wall-clock boundaries.
internal/scheduler/jitter_test.go pins:
- Bounded envelope (each tick within ±jitterPct of interval)
- Mean drift < 30% of nominal (sign-bug detector)
- Stop() releases the goroutine + closes C
- Stop() idempotent (no panic on repeat)
- Zero-jitter behaves like time.NewTicker
- Negative and >=1 jitterPct values clamped defensively
CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks
any future bare time.NewTicker in scheduler.go.
SCALE-L1 (Low) — renewal-sweep semaphore behavior documented
docs/operator/scale.md "Scheduler tick budgets" section explains
the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25
default), the ctx-cancellation drain on tick-budget overrun, and
operator tuning advice (raise concurrency + DB pool together).
No code change — the behavior is defensible as-is per the audit.
SCALE-L2 (Low) — ETag middleware for top-5 read endpoints
internal/api/middleware/etag.go computes SHA-256 ETag over the
buffered response body, respects If-None-Match, short-circuits
to 304 Not Modified on match. GET/HEAD only; non-2xx responses
pass through unchanged. 64 KiB buffer cap degrades gracefully on
oversized responses (no caching, body still flushes intact).
Wired around the top-5 read endpoints via etagged() helper in
internal/api/router/router.go:
GET /api/v1/certificates
GET /api/v1/agents
GET /api/v1/jobs
GET /api/v1/audit
GET /api/v1/discovered-certificates
internal/api/middleware/etag_test.go pins 11 behaviors including
304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass,
4xx/5xx pass-through, oversized-response degradation, wildcard
match, HEAD-treated-like-GET, byte-equal pass-through.
Cross-cutting fixes:
- internal/config/config_test.go::TestLoad_DefaultValues updated
to assert the new 50 default (was 25).
- deploy/helm/certctl/values.yaml comment corrected — agent
pollInterval is hardcoded 30s, not env-configurable; the
Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL
which G-3 caught as a phantom env var.
- asyncpoll.go reformatted by gofmt; functionally unchanged.
Verification (all pass):
grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go # finds 1 site
grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go # config default is 50
grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md # wired
grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go # 0 (all migrated)
grep -cE 'JitteredTicker' internal/scheduler/scheduler.go # 15
ls internal/scheduler/jitter.go internal/api/middleware/etag.go # both exist
ls docs/operator/scale.md # exists
bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh # clean
bash scripts/ci-guards/G-3-env-docs-drift.sh # clean
go test ./internal/scheduler/ ./internal/api/middleware/ \
./internal/connector/issuer/asyncpoll/ ./internal/config/ # 4/4 packages green
Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2
123 lines
3.9 KiB
Go
123 lines
3.9 KiB
Go
// Copyright 2026 certctl LLC. All rights reserved.
|
|
// SPDX-License-Identifier: BUSL-1.1
|
|
|
|
package scheduler
|
|
|
|
import (
|
|
"math/rand/v2"
|
|
"time"
|
|
)
|
|
|
|
// Phase 6 SCALE-M5 closure (2026-05-14): bounded-jitter wrapper
|
|
// around time.Timer to spread scheduler-loop tick co-fires.
|
|
//
|
|
// Pre-Phase-6 the 15 scheduler loops in scheduler.go each used a
|
|
// bare time.NewTicker(interval). When multiple loops share a
|
|
// nominal cadence (e.g. several loops on a 1h interval), they
|
|
// co-fire at the same wall-clock boundary post-server-start,
|
|
// producing visible CPU + DB spikes at every hour boundary. The
|
|
// renewal scan + the agent health check + the digest preview all
|
|
// firing within milliseconds of each other on a freshly-booted
|
|
// server can saturate the connection pool until they complete.
|
|
//
|
|
// JitteredTicker replaces the bare time.NewTicker with a goroutine
|
|
// that fires C once per interval ± jitterPct, drawn fresh on every
|
|
// tick. The base interval is the same as before; only the per-tick
|
|
// envelope changes. This preserves every loop's expected SLO (a
|
|
// renewal scan still runs ~once per hour) while breaking up the
|
|
// co-fire pattern.
|
|
//
|
|
// JitteredTicker.Stop() must be called by the caller (typically via
|
|
// defer) to release the goroutine. After Stop, the C channel is
|
|
// closed.
|
|
type JitteredTicker struct {
|
|
// C is the channel a tick fires on. Read this in the loop's
|
|
// select{} the same way you'd read time.Ticker.C.
|
|
C chan time.Time
|
|
|
|
stopCh chan struct{}
|
|
}
|
|
|
|
// NewJitteredTicker returns a ticker that fires on C every
|
|
// interval ± jitterPct (e.g. jitterPct=0.1 = ±10%). The first tick
|
|
// arrives one (jittered) interval after construction — same as
|
|
// time.NewTicker. jitterPct < 0 is treated as 0 (no jitter, equivalent
|
|
// to time.NewTicker). jitterPct ≥ 1 is clamped to 0.99 (avoid the
|
|
// degenerate "instant tick" case where the jitter consumes the
|
|
// entire interval).
|
|
//
|
|
// interval must be > 0. Callers passing 0 or negative get a panic
|
|
// from time.NewTimer, matching time.NewTicker's existing contract.
|
|
func NewJitteredTicker(interval time.Duration, jitterPct float64) *JitteredTicker {
|
|
if jitterPct < 0 {
|
|
jitterPct = 0
|
|
}
|
|
if jitterPct >= 1 {
|
|
jitterPct = 0.99
|
|
}
|
|
|
|
jt := &JitteredTicker{
|
|
C: make(chan time.Time, 1),
|
|
stopCh: make(chan struct{}),
|
|
}
|
|
|
|
go jt.run(interval, jitterPct)
|
|
return jt
|
|
}
|
|
|
|
// run owns the per-tick scheduling loop. The fresh-per-tick jitter
|
|
// draw prevents drift from compounding (vs. computing the jittered
|
|
// interval once and reusing it).
|
|
func (jt *JitteredTicker) run(interval time.Duration, jitterPct float64) {
|
|
defer close(jt.C)
|
|
|
|
for {
|
|
// Bounded-symmetric jitter around the interval. delta ∈
|
|
// [-jitterPct, +jitterPct) drawn fresh per tick.
|
|
delta := (rand.Float64()*2 - 1) * jitterPct
|
|
next := time.Duration(float64(interval) * (1 + delta))
|
|
// Floor at 1ns so we never feed a zero or negative
|
|
// duration into time.NewTimer; the jitterPct clamp above
|
|
// keeps next > 0 in normal use but a Float64 rounding
|
|
// edge case could otherwise produce 0.
|
|
if next < time.Nanosecond {
|
|
next = time.Nanosecond
|
|
}
|
|
|
|
timer := time.NewTimer(next)
|
|
select {
|
|
case t := <-timer.C:
|
|
select {
|
|
case jt.C <- t:
|
|
// emitted
|
|
case <-jt.stopCh:
|
|
return
|
|
}
|
|
case <-jt.stopCh:
|
|
if !timer.Stop() {
|
|
<-timer.C
|
|
}
|
|
return
|
|
}
|
|
}
|
|
}
|
|
|
|
// Stop releases the goroutine + closes C. Safe to call multiple
|
|
// times; subsequent calls are no-ops (the stopCh close is the
|
|
// only side effect, and re-closing a closed channel would panic,
|
|
// so we guard via a select+default).
|
|
func (jt *JitteredTicker) Stop() {
|
|
select {
|
|
case <-jt.stopCh:
|
|
// already closed; no-op
|
|
default:
|
|
close(jt.stopCh)
|
|
}
|
|
}
|
|
|
|
// DefaultSchedulerJitter is the jitter percentage applied to every
|
|
// scheduler-loop tick. ±10% is the industry-standard "spread but
|
|
// don't blur SLO" envelope used by Kubernetes controllers, AWS SDK
|
|
// retries, and Prometheus scrape intervals.
|
|
const DefaultSchedulerJitter = 0.10
|