feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete)

Phase 13 Sprint 13.3 — the completion half of the ARCH-M1
substantive close. Sprint 13.2 shipped the Postgres-backed
sliding-window limiter + multi-replica integration test; Sprint 13.3
wires the 6 call sites in cmd/server/main.go through the operator-
chosen backend selector, adds the rate_limit_buckets scheduler
janitor sweep, rewrites the observability doc, exposes the env-var
in the helm chart, and promotes the multi-replica integration test
to a required CI status check.

Signature ground-truth (sprint 13.2 + 13.3)
===========================================
Prompt-template signatures: `Allow(key string) error` and "5 call
sites." Actual repo: `Allow(key string, now time.Time) error` and 6
NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt
miscounted the second EST per-principal arm). Per CLAUDE.md "the repo
is truth," matched the live shape.

What changed
============

internal/config/server.go (+40 LOC):
  - Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval
    time.Duration` to RateLimitConfig with full operator-facing
    documentation of the two valid values (memory|postgres) +
    when-to-use-which decision tree.

internal/config/config.go (+27 LOC):
  - Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m).
  - Validate() rejects anything other than ""/"memory"/"postgres"
    (empty = memory equivalence for test-built Configs that bypass
    Load()). Janitor interval must be ≥ 1 minute when set.
  - Failure modes return clear ::error:: with the env-var name + the
    valid values, so an operator typo ("postgress" → memory in a
    3-replica cluster) fails fast at startup.

internal/ratelimit/factory.go (NEW, 67 LOC):
  - NewLimiter(backend, db, maxN, window, mapCap) Limiter — single
    factory the 6 cmd/server/main.go call sites route through.
  - Drop-in signature: same maxN/window/mapCap as
    NewSlidingWindowLimiter (mapCap accepted + ignored for postgres
    — the rate_limit_buckets table grows until the janitor sweeps).
  - Defensive panic on unknown backend (config.Validate is SoT;
    this is belt-and-suspenders).

internal/ratelimit/postgres_gc.go (NEW, 73 LOC):
  - PostgresGC struct + NewPostgresGC + GarbageCollect.
  - Single-statement DELETE FROM rate_limit_buckets WHERE
    updated_at < NOW() - maxWindow. Idempotent.
  - maxWindow <= 0 is a no-op (operator opt-out).

internal/scheduler/scheduler.go (+90 LOC):
  - New RateLimitGarbageCollector interface (mirrors the
    ACMEGarbageCollector / SessionGarbageCollector contracts).
  - rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning
    on Scheduler.
  - SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d)
    Setters following the existing acmeGC/sessionGC pattern.
  - rateLimitGCLoop() — JitteredTicker + atomic.Bool guard +
    per-tick context.WithTimeout(1m). Logs row count at Debug.
  - Loop counted in the Start() WaitGroup only when the GC is
    non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector
    when backend=memory so the loop never launches for that case.

cmd/server/main.go (35 LOC diff):
  - All 6 ratelimit.NewSlidingWindowLimiter call sites now route
    through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend,
    db, ...). Grep verification post-fix returns ZERO hits.
  - Six sites: breakglass loginLimiter (580), ocspLimiter (1003),
    exportLimiter (1068), EST failed-basic (1535), EST per-principal
    SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The
    intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved
    — its inner type-alias wrapper is the prompt's
    out-of-scope (cmd/server/*.go only).
  - Conditionally constructs PostgresGC + wires the scheduler janitor
    when backend=postgres; logs the wiring decision either way so
    operators see "rate-limit GC sweep enabled (postgres backend)"
    or "in-memory backend self-prunes" in the boot log.

internal/api/handler/{est,export,certificates,auth_breakglass}.go:
  - Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types
    with ratelimit.Limiter (the interface). Allow() satisfies the
    same call shape on both backends; the in-memory tests that
    construct *SlidingWindowLimiter still compile because the
    concrete type satisfies the interface (compile-time check in
    internal/ratelimit/limiter.go pins this).

docs/operator/observability.md (176 LOC diff):
  - Replaced the "per-process, in-memory, reset-on-restart, not
    shared across replicas" paragraph with the new
    configurable-backend section: operator decision tree,
    backend internals (memory vs postgres), janitor description,
    falsifiable closure proof (the Sprint 13.2 integration test
    name + invocation), helm chart wiring example.
  - Updated inventory to reflect the actual handler file paths +
    actual cap configurations (the prior doc said "60s window" for
    several limiters that actually use 60m / 24h windows).
  - Doc smoke confirmed: grep -c 'per-process, in-memory,
    reset-on-restart' docs/operator/observability.md = 0.

deploy/helm/certctl/values.yaml + templates/server-configmap.yaml +
templates/server-deployment.yaml:
  - Exposed server.rateLimiting.backend (default "memory") +
    server.rateLimiting.janitorInterval (default "5m") under the
    existing rateLimiting block.
  - ConfigMap renders both as rate-limit-backend +
    rate-limit-janitor-interval keys.
  - Deployment wires CERTCTL_RATE_LIMIT_BACKEND +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap.
  - Helm render: `helm template deploy/helm/certctl --set
    server.rateLimiting.backend=postgres` shows the env-var on the
    server-deployment.yaml output.

.github/workflows/ci.yml (+12 LOC):
  - Added a new step in the Go Build & Test job that runs the
    Sprint 13.2 multi-replica integration test
    (TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with
    -tags=integration -race -timeout=300s. Fails the CI status check
    if the cross-replica row lock ever stops arbitrating across
    replicas — the ARCH-M1 closure regression gate.

Verification (all green locally; postgres integration via CI)
============================================================

  $ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go
    (zero hits — Sprint 13.3 receipt)

  $ go test -short -count=1 \
      ./internal/config/... ./internal/ratelimit/... \
      ./internal/scheduler/... ./internal/api/handler/... \
      ./cmd/server/...
    ok  internal/config       1.177s
    ok  internal/ratelimit    0.007s
    ok  internal/scheduler    9.165s
    ok  internal/api/handler  6.245s
    ok  cmd/server            0.390s

  $ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \
      ./internal/config/... ./internal/api/handler/... ./cmd/server/...
    (clean)

  $ gofmt -l internal/ cmd/server/
    (clean)

  $ grep -c 'per-process, in-memory, reset-on-restart' \
      docs/operator/observability.md
    0   (doc smoke — the audit's verbatim phrasing is gone)

  $ bash scripts/ci-guards/G-3-env-docs-drift.sh
    G-3 env-docs-drift: clean.

  $ bash scripts/ci-guards/complete-path-config-coverage.sh
    OK — every CERTCTL_* env var (197) has at least one non-config-
    package consumer.

Selector contract verified — config.Validate() rejects any value
other than ""/memory/postgres at startup with a clear error message.

Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a
different axis; ARCH-M1 closure is complete with this commit
modulo the Sprint 13.7 audit-HTML flip + zero-floor pin.

Closes: ARCH-M1 substantive remediation. The cross-replica rate-
limit-cap-enforcement gap that the audit recommended deferring to
v3 is closed; operators with server.replicas > 1 flip
CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement
across the cluster (proved by the multi-replica integration test now
gating CI).
This commit is contained in:
shankar0123
2026-05-14 11:52:13 +00:00
parent c8347d742d
commit a41fc2d75c
15 changed files with 516 additions and 61 deletions
+2 -2
View File
@@ -78,7 +78,7 @@ type AuthBreakglassHandler struct {
// nil-safe: when unset, the handler skips the limiter check and
// relies on the service-layer Argon2id lockout. Production deploys
// MUST set this via SetLoginRateLimiter.
loginLimiter *ratelimit.SlidingWindowLimiter
loginLimiter ratelimit.Limiter
}
// NewAuthBreakglassHandler constructs the handler.
@@ -89,7 +89,7 @@ func NewAuthBreakglassHandler(svc BreakglassService, cookieAttrs SessionCookieAt
// SetLoginRateLimiter wires the per-source-IP rate limiter the Login
// handler enforces. Bundle 5 closure (S1) — see the AuthBreakglassHandler
// type docstring for the full rationale.
func (h *AuthBreakglassHandler) SetLoginRateLimiter(l *ratelimit.SlidingWindowLimiter) {
func (h *AuthBreakglassHandler) SetLoginRateLimiter(l ratelimit.Limiter) {
h.loginLimiter = l
}
+2 -2
View File
@@ -52,7 +52,7 @@ type CertificateService interface {
// CertificateHandler handles HTTP requests for certificate operations.
type CertificateHandler struct {
svc CertificateService
ocspLimiter *ratelimit.SlidingWindowLimiter // production hardening II Phase 3 — per-source-IP cap on OCSP
ocspLimiter ratelimit.Limiter // production hardening II Phase 3 — per-source-IP cap on OCSP
}
// NewCertificateHandler creates a new CertificateHandler with a service dependency.
@@ -65,7 +65,7 @@ func NewCertificateHandler(svc CertificateService) CertificateHandler {
// cmd/server/main.go): 1000 req/min/IP. Setting to nil disables the
// limit; the limiter's own NewSlidingWindowLimiter(maxN<=0, ...)
// also produces a no-op limiter, so the env-var-zero case is safe.
func (h *CertificateHandler) SetOCSPRateLimiter(l *ratelimit.SlidingWindowLimiter) {
func (h *CertificateHandler) SetOCSPRateLimiter(l ratelimit.Limiter) {
h.ocspLimiter = l
}
+4 -4
View File
@@ -100,13 +100,13 @@ type ESTHandler struct {
// EST RFC 7030 hardening Phase 3.3: per-handler source-IP rate
// limiter for FAILED HTTP Basic auth attempts. Keyed by sourceIP so
// a hostile network segment can't burn through the password.
failedBasicLimiter *ratelimit.SlidingWindowLimiter
failedBasicLimiter ratelimit.Limiter
// EST RFC 7030 hardening Phase 4.2: per-handler per-principal sliding-
// window rate limit. Keyed by (CSR-CN, sourceIP) so a stolen
// bootstrap cert AND a known device CN can't be used to flood the
// issuer. Disabled when nil; configured per-profile.
perPrincipalLimiter *ratelimit.SlidingWindowLimiter
perPrincipalLimiter ratelimit.Limiter
// labelForLog gives observability code a per-profile string to
// include in audit log lines / Prometheus labels. Defaults to
@@ -170,7 +170,7 @@ func (h *ESTHandler) SetEnrollmentPassword(pw string) { h.basicPassword = pw }
// rate limiter. Phase 3.3. Disabled when nil — but Validate() at
// startup refuses an enabled basic-auth profile without a configured
// limiter, so a real deploy always wires one.
func (h *ESTHandler) SetSourceIPRateLimiter(l *ratelimit.SlidingWindowLimiter) {
func (h *ESTHandler) SetSourceIPRateLimiter(l ratelimit.Limiter) {
h.failedBasicLimiter = l
}
@@ -179,7 +179,7 @@ func (h *ESTHandler) SetSourceIPRateLimiter(l *ratelimit.SlidingWindowLimiter) {
// every successful enrollment, NOT just failures — the goal is to
// bound enrollment-flooding from a compromised credential, not just
// failed-auth brute force.
func (h *ESTHandler) SetPerPrincipalRateLimiter(l *ratelimit.SlidingWindowLimiter) {
func (h *ESTHandler) SetPerPrincipalRateLimiter(l ratelimit.Limiter) {
h.perPrincipalLimiter = l
}
+2 -2
View File
@@ -28,7 +28,7 @@ type ExportService interface {
// ExportHandler handles HTTP requests for certificate export operations.
type ExportHandler struct {
svc ExportService
exportLimiter *ratelimit.SlidingWindowLimiter // production hardening II Phase 3
exportLimiter ratelimit.Limiter // production hardening II Phase 3
}
// NewExportHandler creates a new ExportHandler with a service dependency.
@@ -40,7 +40,7 @@ func NewExportHandler(svc ExportService) ExportHandler {
// Production hardening II Phase 3. Default cap (when set in
// cmd/server/main.go): 50 exports/hr/operator. Setting to nil
// disables the limit.
func (h *ExportHandler) SetExportRateLimiter(l *ratelimit.SlidingWindowLimiter) {
func (h *ExportHandler) SetExportRateLimiter(l ratelimit.Limiter) {
h.exportLimiter = l
}
+37 -5
View File
@@ -441,11 +441,13 @@ func Load() (*Config, error) {
},
},
RateLimit: RateLimitConfig{
Enabled: getEnvBool("CERTCTL_RATE_LIMIT_ENABLED", true),
RPS: getEnvFloat("CERTCTL_RATE_LIMIT_RPS", 50),
BurstSize: getEnvInt("CERTCTL_RATE_LIMIT_BURST", 100),
PerUserRPS: getEnvFloat("CERTCTL_RATE_LIMIT_PER_USER_RPS", 0),
PerUserBurstSize: getEnvInt("CERTCTL_RATE_LIMIT_PER_USER_BURST", 0),
Enabled: getEnvBool("CERTCTL_RATE_LIMIT_ENABLED", true),
RPS: getEnvFloat("CERTCTL_RATE_LIMIT_RPS", 50),
BurstSize: getEnvInt("CERTCTL_RATE_LIMIT_BURST", 100),
PerUserRPS: getEnvFloat("CERTCTL_RATE_LIMIT_PER_USER_RPS", 0),
PerUserBurstSize: getEnvInt("CERTCTL_RATE_LIMIT_PER_USER_BURST", 0),
SlidingWindowBackend: getEnv("CERTCTL_RATE_LIMIT_BACKEND", "memory"),
SlidingWindowJanitorInterval: getEnvDuration("CERTCTL_RATE_LIMIT_JANITOR_INTERVAL", 5*time.Minute),
},
CORS: CORSConfig{
AllowedOrigins: getEnvList("CERTCTL_CORS_ORIGINS", nil),
@@ -764,6 +766,36 @@ func (c *Config) Validate() error {
)
}
// Phase 13 Sprint 13.3 closure (ARCH-M1): validate
// CERTCTL_RATE_LIMIT_BACKEND is one of the two supported values.
// Fail-closed on any other input so a typo doesn't silently fall
// back to the wrong backend (the operator picked "postgress" and
// got memory rate-limits in a 3-replica cluster).
switch c.RateLimit.SlidingWindowBackend {
case "", "memory", "postgres":
// "" is treated as "memory" — test-built Configs (which
// construct the struct literal directly without going
// through Load()) don't get the default; Load() always
// fills "memory". Either path lands the runtime on the
// in-memory backend.
default:
return fmt.Errorf(
"invalid CERTCTL_RATE_LIMIT_BACKEND=%q — refuse to start: must be \"memory\" (default, per-process limits; for single-replica deploys) or \"postgres\" (cross-replica-consistent via the rate_limit_buckets table; required for HA deploys). See docs/operator/observability.md.",
c.RateLimit.SlidingWindowBackend,
)
}
// Janitor interval lower bound — 1 minute. Below this the sweep
// cost outweighs the row-cleanup benefit; above this still
// matches the operator's bound (5 minutes default; can be raised
// indefinitely).
if c.RateLimit.SlidingWindowJanitorInterval > 0 &&
c.RateLimit.SlidingWindowJanitorInterval < time.Minute {
return fmt.Errorf(
"invalid CERTCTL_RATE_LIMIT_JANITOR_INTERVAL=%v — refuse to start: must be ≥ 1 minute (default 5m).",
c.RateLimit.SlidingWindowJanitorInterval,
)
}
// Validate database configuration
if c.Database.URL == "" {
return fmt.Errorf("database URL is required")
+40
View File
@@ -321,6 +321,46 @@ type RateLimitConfig struct {
// zero, BurstSize is used. Default: 0 (use BurstSize).
// Setting: CERTCTL_RATE_LIMIT_PER_USER_BURST environment variable.
PerUserBurstSize int
// SlidingWindowBackend selects which backend implements the
// per-key sliding-window-log limiters wired in cmd/server/main.go
// (break-glass login, OCSP per-IP, cert-export per-actor, EST
// per-principal, EST failed-basic source-IP). Distinct from the
// token-bucket fields above — those are middleware RPS limits
// applied across every request via the http handler chain; this
// field controls the sliding-window-log primitive used by
// authenticated-but-shared-credential code paths.
//
// Valid values:
// "memory" — per-process, sync.Mutex-guarded map (historical
// default; perfect for single-replica deploys).
// "postgres" — cross-replica-consistent via the
// rate_limit_buckets table (migration 000046).
// SELECT FOR UPDATE arbitrates per-key access
// across the cluster. Adds ~2 DB round-trips per
// Allow call; acceptable on the gated hot path.
//
// Default: "memory". HA deploys with server.replicas > 1 should
// flip to "postgres" so a 2-replica deployment doesn't effectively
// double the per-key cap.
//
// Phase 13 Sprint 13.2/13.3 closure (architecture diligence audit
// ARCH-M1). See docs/operator/observability.md.
//
// Setting: CERTCTL_RATE_LIMIT_BACKEND environment variable.
SlidingWindowBackend string
// SlidingWindowJanitorInterval is how often the scheduler sweeps
// stale rows from rate_limit_buckets. A row is stale when its
// updated_at is older than the longest configured window any
// caller uses (currently 24h for the EST per-principal limiter).
// Default: 5 minutes. Minimum: 1 minute. No-op when
// SlidingWindowBackend = "memory" (the in-memory backend's
// prune-on-Allow path keeps buckets short-lived without a
// separate sweep).
//
// Setting: CERTCTL_RATE_LIMIT_JANITOR_INTERVAL environment variable.
SlidingWindowJanitorInterval time.Duration
}
// CORSConfig contains CORS configuration.
+65
View File
@@ -0,0 +1,65 @@
// Copyright 2026 certctl LLC. All rights reserved.
// SPDX-License-Identifier: BUSL-1.1
package ratelimit
import (
"database/sql"
"fmt"
"time"
)
// Phase 13 Sprint 13.3 (2026-05-14, architecture diligence audit
// ARCH-M1): the backend-selector factory. Wires every
// `ratelimit.NewSlidingWindowLimiter(...)` call site in
// cmd/server/main.go through here so the operator-chosen backend
// (CERTCTL_RATE_LIMIT_BACKEND={memory,postgres}) gates the limiter
// type without each call site replicating the switch.
//
// Caller-visible behavior contract: NewLimiter(backend="memory", ...)
// returns a *SlidingWindowLimiter identical to a direct
// NewSlidingWindowLimiter call. NewLimiter(backend="postgres", ...)
// returns a *PostgresSlidingWindowLimiter with the same Allow(key, now)
// signature + the same ErrRateLimited sentinel + the same maxN<=0
// disabled semantics. Sprint 13.3's "no signature change" rule is
// what makes the swap drop-in.
//
// The mapCap argument is the in-memory backend's per-instance
// key-cap (LRU-evicted under pressure). Postgres backend has no
// equivalent — the table grows until the scheduler janitor sweeps
// stale rows; mapCap is accepted + ignored for that backend so the
// factory signature stays drop-in identical to NewSlidingWindowLimiter.
// NewLimiter returns a Limiter backed by either the in-memory
// SlidingWindowLimiter (backend="memory") or the
// PostgresSlidingWindowLimiter (backend="postgres").
//
// `backend` is validated by config.Validate() at startup; any other
// value here panics — config validation is the SoT, this is just
// defensive in case the call site somehow bypasses startup
// validation.
//
// `db` is required when backend="postgres" and ignored when
// backend="memory". The factory does not nil-check db for the
// memory branch because requiring a meaningful db handle for the
// memory path would couple every limiter call site to the database
// pool unnecessarily.
//
// `maxN <= 0` disables the limiter (both backends honor the
// opt-out — all Allow calls return nil).
func NewLimiter(backend string, db *sql.DB, maxN int, window time.Duration, mapCap int) Limiter {
switch backend {
case "memory":
return NewSlidingWindowLimiter(maxN, window, mapCap)
case "postgres":
if db == nil {
panic("ratelimit.NewLimiter: backend=postgres requires a non-nil *sql.DB (config.Validate should have caught this earlier)")
}
return NewPostgresSlidingWindowLimiter(db, maxN, window)
default:
// Defensive — config.Validate() rejects anything else at
// startup. Reaching this branch implies a coding error in a
// future call site that bypasses validation.
panic(fmt.Sprintf("ratelimit.NewLimiter: unknown backend %q (must be memory or postgres)", backend))
}
}
+71
View File
@@ -0,0 +1,71 @@
// Copyright 2026 certctl LLC. All rights reserved.
// SPDX-License-Identifier: BUSL-1.1
package ratelimit
import (
"context"
"database/sql"
"fmt"
"time"
)
// Phase 13 Sprint 13.3 closure (2026-05-14, architecture diligence audit
// ARCH-M1): the scheduler-invoked janitor for the postgres-backed
// rate-limit bucket table. Sweeps rows whose updated_at is older than
// the longest configured window any caller uses — these rows can
// never be at-cap (every timestamp inside has aged past the window),
// so dropping them entirely is safe.
//
// The in-memory backend's prune-on-Allow path keeps buckets short-
// lived without a separate sweep; this file is postgres-only.
// PostgresGC drives the rate_limit_buckets sweep. Constructed from the
// same *sql.DB the limiters use; the scheduler holds it as a value
// satisfying the ratelimit.GarbageCollector interface (mirrors the
// shape of acme.GarbageCollector + sessions.GarbageCollector).
type PostgresGC struct {
db *sql.DB
maxWindow time.Duration
}
// NewPostgresGC returns a janitor that sweeps rows whose updated_at
// is older than `maxWindow` ago. Pass the longest window any caller
// in the deployment configures (the EST per-principal limiter uses
// 24h today; bump if a new caller introduces a longer window).
//
// maxWindow <= 0 disables the sweep — GarbageCollect becomes a
// no-op. Operator opt-out for sketchpad / single-replica deploys
// that still want the postgres backend (rare; the memory backend is
// the better fit).
func NewPostgresGC(db *sql.DB, maxWindow time.Duration) *PostgresGC {
return &PostgresGC{db: db, maxWindow: maxWindow}
}
// GarbageCollect deletes every rate_limit_buckets row whose
// updated_at is older than now-maxWindow. Returns the number of
// rows deleted + any error from the DELETE.
//
// Single statement, single round-trip — operates on the
// rate_limit_buckets_updated_at_idx index introduced in migration
// 000046. Idempotent: repeated calls find 0 rows.
func (g *PostgresGC) GarbageCollect(ctx context.Context) (int64, error) {
if g.maxWindow <= 0 {
return 0, nil
}
cutoff := time.Now().Add(-g.maxWindow)
res, err := g.db.ExecContext(ctx, `
DELETE FROM rate_limit_buckets
WHERE updated_at < $1
`, cutoff)
if err != nil {
return 0, fmt.Errorf("ratelimit-gc: delete stale buckets: %w", err)
}
n, err := res.RowsAffected()
if err != nil {
// Driver doesn't expose RowsAffected; rare. Don't fail the
// sweep — the delete already ran.
return 0, nil
}
return n, nil
}
+90
View File
@@ -103,6 +103,21 @@ type BCLReplayGarbageCollector interface {
SweepExpired(ctx context.Context, now time.Time) (int, error)
}
// RateLimitGarbageCollector sweeps stale rows from the
// rate_limit_buckets table introduced in migration 000046. Phase 13
// Sprint 13.3 (ARCH-M1 closure completion) — wired only when
// CERTCTL_RATE_LIMIT_BACKEND=postgres. Concrete impl is
// *ratelimit.PostgresGC. Mirrors the ACMEGarbageCollector +
// SessionGarbageCollector contracts so the scheduler reuses the same
// atomic.Bool + WithTimeout + ticker pattern as the existing GC loops.
//
// Returns the row count to surface via observability logs (matches
// SessionGarbageCollector's shape — the operator wants to see
// "how many buckets did the sweep delete" in steady-state monitoring).
type RateLimitGarbageCollector interface {
GarbageCollect(ctx context.Context) (int64, error)
}
// JobReaperService defines the interface for job timeout reaping used by the scheduler.
type JobReaperService interface {
ReapTimedOutJobs(ctx context.Context, csrTTL, approvalTTL time.Duration) error
@@ -130,6 +145,7 @@ type Scheduler struct {
acmeGC ACMEGarbageCollector
sessionGC SessionGarbageCollector
bclReplayGC BCLReplayGarbageCollector
rateLimitGC RateLimitGarbageCollector
jobReaper JobReaperService
logger *slog.Logger
@@ -149,6 +165,7 @@ type Scheduler struct {
jobTimeoutInterval time.Duration
acmeGCInterval time.Duration
sessionGCInterval time.Duration
rateLimitGCInterval time.Duration
// agentOfflineJobTTL: per-tick threshold for reaping Running jobs whose
// owning agent has been silent. Bundle C / Audit M-016. Defaults below.
agentOfflineJobTTL time.Duration
@@ -171,6 +188,7 @@ type Scheduler struct {
jobTimeoutRunning atomic.Bool
acmeGCRunning atomic.Bool
sessionGCRunning atomic.Bool
rateLimitGCRunning atomic.Bool
// Graceful shutdown: wait for in-flight work to complete
wg sync.WaitGroup
@@ -209,6 +227,7 @@ func NewScheduler(
jobTimeoutInterval: 10 * time.Minute,
acmeGCInterval: 1 * time.Minute,
sessionGCInterval: 1 * time.Hour,
rateLimitGCInterval: 5 * time.Minute,
// 5 minutes is 5×agentHealthCheckInterval default of 1m; an agent
// must miss multiple heartbeats before its in-flight jobs are reaped.
agentOfflineJobTTL: 5 * time.Minute,
@@ -365,6 +384,29 @@ func (s *Scheduler) SetSessionGCInterval(d time.Duration) {
s.sessionGCInterval = d
}
// SetRateLimitGarbageCollector wires the Phase 13 Sprint 13.3 rate-
// limit bucket GC. Optional; nil disables the loop (which is the
// correct behavior when CERTCTL_RATE_LIMIT_BACKEND=memory — the
// in-memory backend's prune-on-Allow path keeps buckets short-lived
// without a separate sweep).
//
// Concrete impl is *ratelimit.PostgresGC, constructed in
// cmd/server/main.go only when the postgres backend is selected.
func (s *Scheduler) SetRateLimitGarbageCollector(gc RateLimitGarbageCollector) {
s.rateLimitGC = gc
}
// SetRateLimitGCInterval configures the interval at which the rate-
// limit GC sweep runs. Default 5m. Wire:
// CERTCTL_RATE_LIMIT_JANITOR_INTERVAL. Zero or negative values are
// ignored.
func (s *Scheduler) SetRateLimitGCInterval(d time.Duration) {
if d <= 0 {
return
}
s.rateLimitGCInterval = d
}
// SetAgentOfflineJobTTL sets the threshold past which a Running job whose
// owning agent has gone silent is reaped to Failed. Bundle C / Audit M-016.
// Zero or negative values are ignored (the default of 5 minutes is kept).
@@ -426,6 +468,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
if s.sessionGC != nil {
loopCount++
}
if s.rateLimitGC != nil {
loopCount++
}
s.wg.Add(loopCount)
go func() { defer s.wg.Done(); s.renewalCheckLoop(ctx) }()
@@ -457,6 +502,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
if s.sessionGC != nil {
go func() { defer s.wg.Done(); s.sessionGCLoop(ctx) }()
}
if s.rateLimitGC != nil {
go func() { defer s.wg.Done(); s.rateLimitGCLoop(ctx) }()
}
// Signal that all loops are launched
close(startedChan)
@@ -1247,3 +1295,45 @@ func (s *Scheduler) sessionGCLoop(ctx context.Context) {
}
}
}
// rateLimitGCLoop runs every rateLimitGCInterval and invokes
// RateLimitGarbageCollector.GarbageCollect, which sweeps stale rows
// from the rate_limit_buckets table introduced in Phase 13 Sprint
// 13.2's migration 000046.
//
// Wired only when CERTCTL_RATE_LIMIT_BACKEND=postgres (the in-memory
// backend's prune-on-Allow path keeps buckets short-lived without a
// separate sweep — cmd/server/main.go skips SetRateLimitGarbageCollector
// for that case so this loop never launches).
//
// Phase 13 Sprint 13.3 closure. The atomic.Bool guard + per-tick
// context.WithTimeout match every other GC loop's pattern.
func (s *Scheduler) rateLimitGCLoop(ctx context.Context) {
ticker := NewJitteredTicker(s.rateLimitGCInterval, DefaultSchedulerJitter)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
if !s.rateLimitGCRunning.CompareAndSwap(false, true) {
s.logger.Warn("rate-limit GC sweep still running, skipping tick")
continue
}
s.wg.Add(1)
go func() {
defer s.wg.Done()
defer s.rateLimitGCRunning.Store(false)
// 1-minute timeout matches acme + session GC loops.
opCtx, cancel := context.WithTimeout(ctx, time.Minute)
defer cancel()
if n, err := s.rateLimitGC.GarbageCollect(opCtx); err != nil {
s.logger.Warn("rate-limit gc sweep failed (next tick will retry)", "error", err)
} else if n > 0 {
s.logger.Debug("rate-limit gc swept stale buckets", "rows", n)
}
}()
}
}
}