mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-10 18:18:52 +00:00
scheduler+db: close Phase 6 — scale hardening across pool, jitter, ETag, asyncpoll
Phase 6 of the certctl architecture diligence remediation. Five
findings across the same scheduler-and-DB-pool surface.
SCALE-M1 (Med) — DB pool default bumped 25 → 50
internal/config/config.go line 1972:
MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50)
Postgres default max_connections is 100; 50 leaves headroom for
pg_dump + ad-hoc psql + a server replica without exhausting the
DB-side cap. Operator override env var unchanged. Operator-tune
ladder for larger fleets (5K / 50K certs) lives in
docs/operator/scale.md as starter values pending Phase 8 load
tests — explicitly marked TBD.
SCALE-M3 (Med) — async-CA poll budget operator-configurable
Live state was partially-already-shipped: all 4 async-CA
connectors (digicert, entrust, globalsign, sectigo) already have
per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5
closed pre-Phase-6). What was missing: a global package-default
override. Shipped:
- internal/connector/issuer/asyncpoll/asyncpoll.go gains
SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the
currentDefaultMaxWait() priority resolver.
- cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS
at boot and calls SetDefaultMaxWait.
- deploy/ENVIRONMENTS.md documents the new env var (G-3 guard
green).
Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS:
the live code tracks wall-clock time (MaxWait), not attempt count.
Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS)
so the priority chain reads naturally.
SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops
internal/scheduler/jitter.go ships NewJitteredTicker(interval,
jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in
internal/scheduler/scheduler.go migrated from bare time.NewTicker
to NewJitteredTicker(interval, DefaultSchedulerJitter). Base
intervals unchanged; only the per-tick envelope adds ±10%
randomized delay so multiple loops with the same nominal cadence
don't co-fire and spike CPU + DB at wall-clock boundaries.
internal/scheduler/jitter_test.go pins:
- Bounded envelope (each tick within ±jitterPct of interval)
- Mean drift < 30% of nominal (sign-bug detector)
- Stop() releases the goroutine + closes C
- Stop() idempotent (no panic on repeat)
- Zero-jitter behaves like time.NewTicker
- Negative and >=1 jitterPct values clamped defensively
CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks
any future bare time.NewTicker in scheduler.go.
SCALE-L1 (Low) — renewal-sweep semaphore behavior documented
docs/operator/scale.md "Scheduler tick budgets" section explains
the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25
default), the ctx-cancellation drain on tick-budget overrun, and
operator tuning advice (raise concurrency + DB pool together).
No code change — the behavior is defensible as-is per the audit.
SCALE-L2 (Low) — ETag middleware for top-5 read endpoints
internal/api/middleware/etag.go computes SHA-256 ETag over the
buffered response body, respects If-None-Match, short-circuits
to 304 Not Modified on match. GET/HEAD only; non-2xx responses
pass through unchanged. 64 KiB buffer cap degrades gracefully on
oversized responses (no caching, body still flushes intact).
Wired around the top-5 read endpoints via etagged() helper in
internal/api/router/router.go:
GET /api/v1/certificates
GET /api/v1/agents
GET /api/v1/jobs
GET /api/v1/audit
GET /api/v1/discovered-certificates
internal/api/middleware/etag_test.go pins 11 behaviors including
304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass,
4xx/5xx pass-through, oversized-response degradation, wildcard
match, HEAD-treated-like-GET, byte-equal pass-through.
Cross-cutting fixes:
- internal/config/config_test.go::TestLoad_DefaultValues updated
to assert the new 50 default (was 25).
- deploy/helm/certctl/values.yaml comment corrected — agent
pollInterval is hardcoded 30s, not env-configurable; the
Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL
which G-3 caught as a phantom env var.
- asyncpoll.go reformatted by gofmt; functionally unchanged.
Verification (all pass):
grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go # finds 1 site
grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go # config default is 50
grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md # wired
grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go # 0 (all migrated)
grep -cE 'JitteredTicker' internal/scheduler/scheduler.go # 15
ls internal/scheduler/jitter.go internal/api/middleware/etag.go # both exist
ls docs/operator/scale.md # exists
bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh # clean
bash scripts/ci-guards/G-3-env-docs-drift.sh # clean
go test ./internal/scheduler/ ./internal/api/middleware/ \
./internal/connector/issuer/asyncpoll/ ./internal/config/ # 4/4 packages green
Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1
cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2
This commit is contained in:
@@ -11,6 +11,23 @@ import (
|
||||
"github.com/certctl-io/certctl/internal/auth"
|
||||
)
|
||||
|
||||
// etagged wraps a list-endpoint handler with the SCALE-L2 ETag
|
||||
// middleware. Phase 6 SCALE-L2 closure (2026-05-14): the top-5
|
||||
// read-heavy list endpoints (/certificates, /jobs, /agents,
|
||||
// /audit, /discovered-certificates) get ETag + If-None-Match
|
||||
// short-circuit to avoid re-running their SELECT COUNT(*) +
|
||||
// row-marshaling pass on every dashboard poll. The helper chains
|
||||
// ETag around an already-rbac-gated handler so order is:
|
||||
//
|
||||
// request → ETag → rbacGate → handler
|
||||
//
|
||||
// (auth runs BEFORE the cache check; we never short-circuit a
|
||||
// 304 to an unauthenticated request. The middleware emits ETag
|
||||
// only on 2xx responses, so 401s/403s never get cached.)
|
||||
func etagged(h http.Handler) http.Handler {
|
||||
return middleware.ETag(h)
|
||||
}
|
||||
|
||||
// rbacGate wraps a handler with auth.RequirePermission(checker, perm,
|
||||
// nil) — i.e. a GLOBAL-SCOPE permission check. Used by RegisterHandlers
|
||||
// to gate every state-changing + read endpoint. When checker is nil the
|
||||
@@ -567,7 +584,7 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
|
||||
r.Register("POST /api/v1/est/certificates/bulk-revoke", rbacGate(reg.Checker, "cert.bulk_revoke", reg.BulkRevocation.BulkRevokeEST))
|
||||
r.Register("POST /api/v1/certificates/bulk-renew", rbacGate(reg.Checker, "cert.issue", reg.BulkRenewal.BulkRenew))
|
||||
r.Register("POST /api/v1/certificates/bulk-reassign", rbacGate(reg.Checker, "cert.edit", reg.BulkReassignment.BulkReassign))
|
||||
r.Register("GET /api/v1/certificates", rbacGate(reg.Checker, "cert.read", reg.Certificates.ListCertificates))
|
||||
r.Register("GET /api/v1/certificates", etagged(rbacGate(reg.Checker, "cert.read", reg.Certificates.ListCertificates)))
|
||||
r.Register("POST /api/v1/certificates", rbacGate(reg.Checker, "cert.issue", reg.Certificates.CreateCertificate))
|
||||
r.Register("GET /api/v1/certificates/{id}", rbacGate(reg.Checker, "cert.read", reg.Certificates.GetCertificate))
|
||||
r.Register("PUT /api/v1/certificates/{id}", rbacGate(reg.Checker, "cert.edit", reg.Certificates.UpdateCertificate))
|
||||
@@ -619,7 +636,7 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
|
||||
// * DELETE /api/v1/agents/{id} — RetireAgent. Replaces the pre-I-004
|
||||
// hard-delete; the underlying repo does a soft-retire with
|
||||
// optional cascade.
|
||||
r.Register("GET /api/v1/agents", rbacGate(reg.Checker, "agent.read", reg.Agents.ListAgents))
|
||||
r.Register("GET /api/v1/agents", etagged(rbacGate(reg.Checker, "agent.read", reg.Agents.ListAgents)))
|
||||
r.Register("POST /api/v1/agents", rbacGate(reg.Checker, "agent.edit", reg.Agents.RegisterAgent))
|
||||
r.Register("GET /api/v1/agents/retired", rbacGate(reg.Checker, "agent.read", reg.Agents.ListRetiredAgents))
|
||||
r.Register("GET /api/v1/agents/{id}", rbacGate(reg.Checker, "agent.read", reg.Agents.GetAgent))
|
||||
@@ -631,7 +648,7 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
|
||||
r.Register("POST /api/v1/agents/{id}/jobs/{job_id}/status", rbacGate(reg.Checker, "agent.job.complete", reg.Agents.AgentReportJobStatus))
|
||||
|
||||
// Jobs routes: /api/v1/jobs
|
||||
r.Register("GET /api/v1/jobs", rbacGate(reg.Checker, "job.read", reg.Jobs.ListJobs))
|
||||
r.Register("GET /api/v1/jobs", etagged(rbacGate(reg.Checker, "job.read", reg.Jobs.ListJobs)))
|
||||
r.Register("GET /api/v1/jobs/{id}", rbacGate(reg.Checker, "job.read", reg.Jobs.GetJob))
|
||||
r.Register("POST /api/v1/jobs/{id}/cancel", rbacGate(reg.Checker, "job.cancel", reg.Jobs.CancelJob))
|
||||
r.Register("POST /api/v1/jobs/{id}/approve", rbacGate(reg.Checker, "approval.approve", reg.Jobs.ApproveJob))
|
||||
@@ -695,7 +712,7 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
|
||||
r.Register("GET /api/v1/agent-groups/{id}/members", rbacGate(reg.Checker, "agent.read", reg.AgentGroups.ListAgentGroupMembers))
|
||||
|
||||
// Audit routes: /api/v1/audit
|
||||
r.Register("GET /api/v1/audit", rbacGate(reg.Checker, "audit.read", reg.Audit.ListAuditEvents))
|
||||
r.Register("GET /api/v1/audit", etagged(rbacGate(reg.Checker, "audit.read", reg.Audit.ListAuditEvents)))
|
||||
// Audit 2026-05-10 HIGH-11 closure — `audit.export` permission was
|
||||
// already seeded into r-admin + r-auditor (migration 000031), but
|
||||
// no endpoint enforced it pre-fix; r-auditor's claim was misleading
|
||||
@@ -765,7 +782,7 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
|
||||
|
||||
// Discovery routes: /api/v1/discovered-certificates, /api/v1/discovery-scans
|
||||
r.Register("POST /api/v1/agents/{id}/discoveries", rbacGate(reg.Checker, "discovery.run", reg.Discovery.SubmitDiscoveryReport))
|
||||
r.Register("GET /api/v1/discovered-certificates", rbacGate(reg.Checker, "discovery.read", reg.Discovery.ListDiscovered))
|
||||
r.Register("GET /api/v1/discovered-certificates", etagged(rbacGate(reg.Checker, "discovery.read", reg.Discovery.ListDiscovered)))
|
||||
r.Register("GET /api/v1/discovered-certificates/{id}", rbacGate(reg.Checker, "discovery.read", reg.Discovery.GetDiscovered))
|
||||
r.Register("POST /api/v1/discovered-certificates/{id}/claim", rbacGate(reg.Checker, "discovery.claim", reg.Discovery.ClaimDiscovered))
|
||||
r.Register("POST /api/v1/discovered-certificates/{id}/dismiss", rbacGate(reg.Checker, "discovery.claim", reg.Discovery.DismissDiscovered))
|
||||
|
||||
Reference in New Issue
Block a user