mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 13:51:36 +00:00
feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete)
Phase 13 Sprint 13.3 — the completion half of the ARCH-M1
substantive close. Sprint 13.2 shipped the Postgres-backed
sliding-window limiter + multi-replica integration test; Sprint 13.3
wires the 6 call sites in cmd/server/main.go through the operator-
chosen backend selector, adds the rate_limit_buckets scheduler
janitor sweep, rewrites the observability doc, exposes the env-var
in the helm chart, and promotes the multi-replica integration test
to a required CI status check.
Signature ground-truth (sprint 13.2 + 13.3)
===========================================
Prompt-template signatures: `Allow(key string) error` and "5 call
sites." Actual repo: `Allow(key string, now time.Time) error` and 6
NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt
miscounted the second EST per-principal arm). Per CLAUDE.md "the repo
is truth," matched the live shape.
What changed
============
internal/config/server.go (+40 LOC):
- Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval
time.Duration` to RateLimitConfig with full operator-facing
documentation of the two valid values (memory|postgres) +
when-to-use-which decision tree.
internal/config/config.go (+27 LOC):
- Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") +
CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m).
- Validate() rejects anything other than ""/"memory"/"postgres"
(empty = memory equivalence for test-built Configs that bypass
Load()). Janitor interval must be ≥ 1 minute when set.
- Failure modes return clear ::error:: with the env-var name + the
valid values, so an operator typo ("postgress" → memory in a
3-replica cluster) fails fast at startup.
internal/ratelimit/factory.go (NEW, 67 LOC):
- NewLimiter(backend, db, maxN, window, mapCap) Limiter — single
factory the 6 cmd/server/main.go call sites route through.
- Drop-in signature: same maxN/window/mapCap as
NewSlidingWindowLimiter (mapCap accepted + ignored for postgres
— the rate_limit_buckets table grows until the janitor sweeps).
- Defensive panic on unknown backend (config.Validate is SoT;
this is belt-and-suspenders).
internal/ratelimit/postgres_gc.go (NEW, 73 LOC):
- PostgresGC struct + NewPostgresGC + GarbageCollect.
- Single-statement DELETE FROM rate_limit_buckets WHERE
updated_at < NOW() - maxWindow. Idempotent.
- maxWindow <= 0 is a no-op (operator opt-out).
internal/scheduler/scheduler.go (+90 LOC):
- New RateLimitGarbageCollector interface (mirrors the
ACMEGarbageCollector / SessionGarbageCollector contracts).
- rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning
on Scheduler.
- SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d)
Setters following the existing acmeGC/sessionGC pattern.
- rateLimitGCLoop() — JitteredTicker + atomic.Bool guard +
per-tick context.WithTimeout(1m). Logs row count at Debug.
- Loop counted in the Start() WaitGroup only when the GC is
non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector
when backend=memory so the loop never launches for that case.
cmd/server/main.go (35 LOC diff):
- All 6 ratelimit.NewSlidingWindowLimiter call sites now route
through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend,
db, ...). Grep verification post-fix returns ZERO hits.
- Six sites: breakglass loginLimiter (580), ocspLimiter (1003),
exportLimiter (1068), EST failed-basic (1535), EST per-principal
SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The
intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved
— its inner type-alias wrapper is the prompt's
out-of-scope (cmd/server/*.go only).
- Conditionally constructs PostgresGC + wires the scheduler janitor
when backend=postgres; logs the wiring decision either way so
operators see "rate-limit GC sweep enabled (postgres backend)"
or "in-memory backend self-prunes" in the boot log.
internal/api/handler/{est,export,certificates,auth_breakglass}.go:
- Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types
with ratelimit.Limiter (the interface). Allow() satisfies the
same call shape on both backends; the in-memory tests that
construct *SlidingWindowLimiter still compile because the
concrete type satisfies the interface (compile-time check in
internal/ratelimit/limiter.go pins this).
docs/operator/observability.md (176 LOC diff):
- Replaced the "per-process, in-memory, reset-on-restart, not
shared across replicas" paragraph with the new
configurable-backend section: operator decision tree,
backend internals (memory vs postgres), janitor description,
falsifiable closure proof (the Sprint 13.2 integration test
name + invocation), helm chart wiring example.
- Updated inventory to reflect the actual handler file paths +
actual cap configurations (the prior doc said "60s window" for
several limiters that actually use 60m / 24h windows).
- Doc smoke confirmed: grep -c 'per-process, in-memory,
reset-on-restart' docs/operator/observability.md = 0.
deploy/helm/certctl/values.yaml + templates/server-configmap.yaml +
templates/server-deployment.yaml:
- Exposed server.rateLimiting.backend (default "memory") +
server.rateLimiting.janitorInterval (default "5m") under the
existing rateLimiting block.
- ConfigMap renders both as rate-limit-backend +
rate-limit-janitor-interval keys.
- Deployment wires CERTCTL_RATE_LIMIT_BACKEND +
CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap.
- Helm render: `helm template deploy/helm/certctl --set
server.rateLimiting.backend=postgres` shows the env-var on the
server-deployment.yaml output.
.github/workflows/ci.yml (+12 LOC):
- Added a new step in the Go Build & Test job that runs the
Sprint 13.2 multi-replica integration test
(TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with
-tags=integration -race -timeout=300s. Fails the CI status check
if the cross-replica row lock ever stops arbitrating across
replicas — the ARCH-M1 closure regression gate.
Verification (all green locally; postgres integration via CI)
============================================================
$ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go
(zero hits — Sprint 13.3 receipt)
$ go test -short -count=1 \
./internal/config/... ./internal/ratelimit/... \
./internal/scheduler/... ./internal/api/handler/... \
./cmd/server/...
ok internal/config 1.177s
ok internal/ratelimit 0.007s
ok internal/scheduler 9.165s
ok internal/api/handler 6.245s
ok cmd/server 0.390s
$ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \
./internal/config/... ./internal/api/handler/... ./cmd/server/...
(clean)
$ gofmt -l internal/ cmd/server/
(clean)
$ grep -c 'per-process, in-memory, reset-on-restart' \
docs/operator/observability.md
0 (doc smoke — the audit's verbatim phrasing is gone)
$ bash scripts/ci-guards/G-3-env-docs-drift.sh
G-3 env-docs-drift: clean.
$ bash scripts/ci-guards/complete-path-config-coverage.sh
OK — every CERTCTL_* env var (197) has at least one non-config-
package consumer.
Selector contract verified — config.Validate() rejects any value
other than ""/memory/postgres at startup with a clear error message.
Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a
different axis; ARCH-M1 closure is complete with this commit
modulo the Sprint 13.7 audit-HTML flip + zero-floor pin.
Closes: ARCH-M1 substantive remediation. The cross-replica rate-
limit-cap-enforcement gap that the audit recommended deferring to
v3 is closed; operators with server.replicas > 1 flip
CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement
across the cluster (proved by the multi-replica integration test now
gating CI).
This commit is contained in:
@@ -0,0 +1,65 @@
|
||||
// Copyright 2026 certctl LLC. All rights reserved.
|
||||
// SPDX-License-Identifier: BUSL-1.1
|
||||
|
||||
package ratelimit
|
||||
|
||||
import (
|
||||
"database/sql"
|
||||
"fmt"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Phase 13 Sprint 13.3 (2026-05-14, architecture diligence audit
|
||||
// ARCH-M1): the backend-selector factory. Wires every
|
||||
// `ratelimit.NewSlidingWindowLimiter(...)` call site in
|
||||
// cmd/server/main.go through here so the operator-chosen backend
|
||||
// (CERTCTL_RATE_LIMIT_BACKEND={memory,postgres}) gates the limiter
|
||||
// type without each call site replicating the switch.
|
||||
//
|
||||
// Caller-visible behavior contract: NewLimiter(backend="memory", ...)
|
||||
// returns a *SlidingWindowLimiter identical to a direct
|
||||
// NewSlidingWindowLimiter call. NewLimiter(backend="postgres", ...)
|
||||
// returns a *PostgresSlidingWindowLimiter with the same Allow(key, now)
|
||||
// signature + the same ErrRateLimited sentinel + the same maxN<=0
|
||||
// disabled semantics. Sprint 13.3's "no signature change" rule is
|
||||
// what makes the swap drop-in.
|
||||
//
|
||||
// The mapCap argument is the in-memory backend's per-instance
|
||||
// key-cap (LRU-evicted under pressure). Postgres backend has no
|
||||
// equivalent — the table grows until the scheduler janitor sweeps
|
||||
// stale rows; mapCap is accepted + ignored for that backend so the
|
||||
// factory signature stays drop-in identical to NewSlidingWindowLimiter.
|
||||
|
||||
// NewLimiter returns a Limiter backed by either the in-memory
|
||||
// SlidingWindowLimiter (backend="memory") or the
|
||||
// PostgresSlidingWindowLimiter (backend="postgres").
|
||||
//
|
||||
// `backend` is validated by config.Validate() at startup; any other
|
||||
// value here panics — config validation is the SoT, this is just
|
||||
// defensive in case the call site somehow bypasses startup
|
||||
// validation.
|
||||
//
|
||||
// `db` is required when backend="postgres" and ignored when
|
||||
// backend="memory". The factory does not nil-check db for the
|
||||
// memory branch because requiring a meaningful db handle for the
|
||||
// memory path would couple every limiter call site to the database
|
||||
// pool unnecessarily.
|
||||
//
|
||||
// `maxN <= 0` disables the limiter (both backends honor the
|
||||
// opt-out — all Allow calls return nil).
|
||||
func NewLimiter(backend string, db *sql.DB, maxN int, window time.Duration, mapCap int) Limiter {
|
||||
switch backend {
|
||||
case "memory":
|
||||
return NewSlidingWindowLimiter(maxN, window, mapCap)
|
||||
case "postgres":
|
||||
if db == nil {
|
||||
panic("ratelimit.NewLimiter: backend=postgres requires a non-nil *sql.DB (config.Validate should have caught this earlier)")
|
||||
}
|
||||
return NewPostgresSlidingWindowLimiter(db, maxN, window)
|
||||
default:
|
||||
// Defensive — config.Validate() rejects anything else at
|
||||
// startup. Reaching this branch implies a coding error in a
|
||||
// future call site that bypasses validation.
|
||||
panic(fmt.Sprintf("ratelimit.NewLimiter: unknown backend %q (must be memory or postgres)", backend))
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,71 @@
|
||||
// Copyright 2026 certctl LLC. All rights reserved.
|
||||
// SPDX-License-Identifier: BUSL-1.1
|
||||
|
||||
package ratelimit
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"fmt"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Phase 13 Sprint 13.3 closure (2026-05-14, architecture diligence audit
|
||||
// ARCH-M1): the scheduler-invoked janitor for the postgres-backed
|
||||
// rate-limit bucket table. Sweeps rows whose updated_at is older than
|
||||
// the longest configured window any caller uses — these rows can
|
||||
// never be at-cap (every timestamp inside has aged past the window),
|
||||
// so dropping them entirely is safe.
|
||||
//
|
||||
// The in-memory backend's prune-on-Allow path keeps buckets short-
|
||||
// lived without a separate sweep; this file is postgres-only.
|
||||
|
||||
// PostgresGC drives the rate_limit_buckets sweep. Constructed from the
|
||||
// same *sql.DB the limiters use; the scheduler holds it as a value
|
||||
// satisfying the ratelimit.GarbageCollector interface (mirrors the
|
||||
// shape of acme.GarbageCollector + sessions.GarbageCollector).
|
||||
type PostgresGC struct {
|
||||
db *sql.DB
|
||||
maxWindow time.Duration
|
||||
}
|
||||
|
||||
// NewPostgresGC returns a janitor that sweeps rows whose updated_at
|
||||
// is older than `maxWindow` ago. Pass the longest window any caller
|
||||
// in the deployment configures (the EST per-principal limiter uses
|
||||
// 24h today; bump if a new caller introduces a longer window).
|
||||
//
|
||||
// maxWindow <= 0 disables the sweep — GarbageCollect becomes a
|
||||
// no-op. Operator opt-out for sketchpad / single-replica deploys
|
||||
// that still want the postgres backend (rare; the memory backend is
|
||||
// the better fit).
|
||||
func NewPostgresGC(db *sql.DB, maxWindow time.Duration) *PostgresGC {
|
||||
return &PostgresGC{db: db, maxWindow: maxWindow}
|
||||
}
|
||||
|
||||
// GarbageCollect deletes every rate_limit_buckets row whose
|
||||
// updated_at is older than now-maxWindow. Returns the number of
|
||||
// rows deleted + any error from the DELETE.
|
||||
//
|
||||
// Single statement, single round-trip — operates on the
|
||||
// rate_limit_buckets_updated_at_idx index introduced in migration
|
||||
// 000046. Idempotent: repeated calls find 0 rows.
|
||||
func (g *PostgresGC) GarbageCollect(ctx context.Context) (int64, error) {
|
||||
if g.maxWindow <= 0 {
|
||||
return 0, nil
|
||||
}
|
||||
cutoff := time.Now().Add(-g.maxWindow)
|
||||
res, err := g.db.ExecContext(ctx, `
|
||||
DELETE FROM rate_limit_buckets
|
||||
WHERE updated_at < $1
|
||||
`, cutoff)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("ratelimit-gc: delete stale buckets: %w", err)
|
||||
}
|
||||
n, err := res.RowsAffected()
|
||||
if err != nil {
|
||||
// Driver doesn't expose RowsAffected; rare. Don't fail the
|
||||
// sweep — the delete already ran.
|
||||
return 0, nil
|
||||
}
|
||||
return n, nil
|
||||
}
|
||||
Reference in New Issue
Block a user