mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 21:21:40 +00:00
c8347d742d
Phase 13 Sprint 13.2 closure (architecture diligence audit ARCH-M1):
ships the infrastructure half of the ARCH-M1 substantive close. Adds a
postgres-backed sliding-window rate limiter that satisfies the same
interface as the in-memory primitive — cross-replica-consistent rather
than per-process. Sprint 13.3 wires the 5 call sites through a
backend selector (`CERTCTL_RATELIMIT_BACKEND={memory,postgres}`); this
commit deliberately changes ZERO call sites. The infrastructure +
migration ship as their own review window, mirroring the Phase 9
Sprint 8a/8b pattern.
Substantive close, not document-and-defer
=========================================
The audit recommended "document the per-process limit + defer the
distributed backend to v3." The operator chose Option M1-A (postgres-
backed; zero new infra) over the document-and-defer path. Postgres
is already a hard dependency for certctl; no new operator burden. The
multi-replica integration test in this commit is the falsifiable
closure proof — cap-N enforced exactly across N replicas hitting the
same key concurrently.
Signature ground-truth
======================
The Sprint 13.2 prompt template specified `Allow(key string) error` as
the signature to match. The actual repo signature has been
`Allow(key string, now time.Time) error` since the EST RFC 7030
hardening master bundle Phase 4.1 — the `now` parameter is what makes
the memory limiter testable against synthetic time without an
indirection through clock-injection. The new `Limiter` interface +
`PostgresSlidingWindowLimiter` match the actual repo signature
(`Allow(key string, now time.Time) error`) byte-for-byte. Per CLAUDE.md
"the repo is truth" — the prompt is framing, the code is ground-truth.
Files added
===========
migrations/000046_rate_limit_buckets.up.sql + .down.sql:
- rate_limit_buckets(bucket_key TEXT PRIMARY KEY, timestamps
TIMESTAMPTZ[] NOT NULL DEFAULT '{}', updated_at TIMESTAMPTZ NOT
NULL DEFAULT NOW()).
- btree index on updated_at supports the Sprint 13.3 janitor sweep.
- All statements IF NOT EXISTS / DROP IF EXISTS per CLAUDE.md
"Idempotent migrations" rule.
internal/ratelimit/limiter.go (NEW, 53 LOC):
- Defines the `Limiter` interface with `Allow(key string,
now time.Time) error`.
- Compile-time satisfaction checks for both backends.
- Doc-comment documents the prompt-vs-repo signature reconciliation
+ the Sprint 13.3 backend-selector plan + why the interface stays
minimal (Disabled/Len are non-portable cross-backend; keeping them
off the interface avoids leaking implementation detail).
internal/ratelimit/postgres_sliding_window.go (NEW, 178 LOC):
- PostgresSlidingWindowLimiter struct + NewPostgresSlidingWindowLimiter
constructor + Allow + Disabled methods.
- Algorithm: BEGIN tx → INSERT ON CONFLICT DO NOTHING (ensures the
row exists) → SELECT ... FOR UPDATE (per-key row lock acquired
across the cluster) → prune in Go via the shared pruneOlderThan
helper (single source of truth for prune semantics) → decide
rate-limited or append → UPDATE → COMMIT.
- SELECT FOR UPDATE is what arbitrates across replicas. Replicas A
and B firing simultaneous Allow("k") never race because Postgres
serializes the row-lock; the memory backend's sync.Mutex only
arbitrates within a process.
- Same `maxN <= 0 → disabled` opt-out semantics as the memory
backend.
- Empty-key short-circuit (chokepoint avoidance) matches the memory
backend.
- Uses pq.Array for TIMESTAMPTZ[] marshalling (lib/pq is the
existing project driver).
internal/ratelimit/equivalence_test.go (NEW, 304 LOC):
- Backend-equivalence suite that runs the same scenario set against
both backends via the `Limiter` interface. 7 scenarios per
backend: AllowsUpToCap, DistinctKeysIndependent, WindowExpiry,
DisabledBypass, NegativeCapDisabled, EmptyKeyShortCircuits,
ConcurrentRaceFree.
- Memory half: TestSlidingWindowLimiter_Equivalence_Memory — runs
on every `go test ./...`.
- Postgres half: TestSlidingWindowLimiter_Equivalence_Postgres —
gated by `testing.Short()`; runs only when -short is omitted, so
`go test -race -short ./...` keeps fast.
- Schema-per-test isolation via testcontainers-go (mirrors the
pattern in internal/repository/postgres/testutil_test.go: setup
one container, fresh schema per subtest, search_path-pinned DSN).
- Memory equivalence half re-verifies the same behaviors pinned in
the pre-existing sliding_window_test.go but through the interface
— catches drift if SlidingWindowLimiter.Allow ever changes shape.
internal/integration/ratelimit_multi_replica_test.go (NEW, 159 LOC):
- The falsifiable ARCH-M1 closure proof, gated by //go:build
integration matching the rest of internal/integration/.
- Scenario: 1 postgres container shared across N=3 independent
*PostgresSlidingWindowLimiter instances (each replica's process
has its own *sql.DB pool to the same database, just like a real
HA deployment). 100 concurrent Allow("test-key") calls round-
robin across the 3 limiters via sync.WaitGroup. Cap = 10,
window = 1m, shared now-timestamp so the scenario is
deterministic.
- Assert: exactly 10 succeed + 90 return ErrRateLimited. If the
cross-replica row lock weren't arbitrating, each replica would
independently let through ~3-4 requests (10/3), giving 12-15
successes. The hard-pass on exactly-10 is what makes ARCH-M1
substantive.
What did NOT change
===================
- internal/ratelimit/sliding_window.go (the memory backend) is
byte-identical to its pre-Sprint-13.2 state. Same Mutex, same
Allow signature, same Len/Disabled/pruneOlderThan/evictOldestLocked.
Compile-time check in limiter.go pins that the memory backend
still satisfies the new interface.
- No call site in cmd/server, internal/api/handler, internal/service
changed. Sprint 13.3 owns the 5-site migration + the
CERTCTL_RATELIMIT_BACKEND env-var selector.
- No new operator dependency. Postgres is already required for
certctl-server to boot. Redis (Option M1-B) was declined by the
operator and is not introduced here.
Verification
============
$ ls migrations/000046_rate_limit_buckets.up.sql migrations/000046_rate_limit_buckets.down.sql
$ ls internal/ratelimit/limiter.go internal/ratelimit/postgres_sliding_window.go
$ grep -nE 'sync\.Mutex|sync\.RWMutex' internal/ratelimit/sliding_window.go
30:// by sync.Mutex; per-key slices mutated only while the mutex is
56: mu sync.Mutex
(memory backend untouched)
$ gofmt -l internal/ratelimit/ internal/integration/ → clean
$ go vet ./internal/ratelimit/... → clean
$ go vet -tags=integration ./internal/integration/... → clean
$ staticcheck ./internal/ratelimit/... → clean
$ go build ./... → clean
$ go build -tags=integration ./internal/integration/...→ clean
$ go test -race -short -count=1 ./internal/ratelimit/...
ok github.com/certctl-io/certctl/internal/ratelimit 1.028s
(memory equivalence + sliding_window_test.go both pass; postgres
equivalence skipped under -short as designed)
$ go doc ./internal/ratelimit/
type Limiter interface{ ... }
type PostgresSlidingWindowLimiter struct{ ... }
func NewPostgresSlidingWindowLimiter(db *sql.DB, maxN int,
window time.Duration) *PostgresSlidingWindowLimiter
type SlidingWindowLimiter struct{ ... }
func NewSlidingWindowLimiter(maxN int, window time.Duration,
mapCap int) *SlidingWindowLimiter
var ErrRateLimited = ...
(public surface matches the Sprint 13.2 prompt's required diff)
Sandbox note: the multi-replica integration test + the postgres
equivalence half run under testcontainers-go which requires docker-
in-docker. The CI integration job exercises both; local CI-equivalent
verification was build + vet + staticcheck + memory equivalence (the
sandbox /sessions partition is full so spinning a postgres container
locally isn't viable in this session). The Sprint 13.3 commit will
re-verify against the live integration job.
Next: Sprint 13.3 wires every call site through
ratelimit.NewLimiter(cfg.Server.RateLimitBackend, db, ...) +
introduces the scheduler janitor loop + rewrites the
docs/operator/observability.md "per-process" paragraph to describe
the configurable backend.
Refs: ARCH-M1 (HA / scale — rate limits per-process), Phase 13
Sprint 13.2.