mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 16:21:30 +00:00
c8347d742d
Phase 13 Sprint 13.2 closure (architecture diligence audit ARCH-M1):
ships the infrastructure half of the ARCH-M1 substantive close. Adds a
postgres-backed sliding-window rate limiter that satisfies the same
interface as the in-memory primitive — cross-replica-consistent rather
than per-process. Sprint 13.3 wires the 5 call sites through a
backend selector (`CERTCTL_RATELIMIT_BACKEND={memory,postgres}`); this
commit deliberately changes ZERO call sites. The infrastructure +
migration ship as their own review window, mirroring the Phase 9
Sprint 8a/8b pattern.
Substantive close, not document-and-defer
=========================================
The audit recommended "document the per-process limit + defer the
distributed backend to v3." The operator chose Option M1-A (postgres-
backed; zero new infra) over the document-and-defer path. Postgres
is already a hard dependency for certctl; no new operator burden. The
multi-replica integration test in this commit is the falsifiable
closure proof — cap-N enforced exactly across N replicas hitting the
same key concurrently.
Signature ground-truth
======================
The Sprint 13.2 prompt template specified `Allow(key string) error` as
the signature to match. The actual repo signature has been
`Allow(key string, now time.Time) error` since the EST RFC 7030
hardening master bundle Phase 4.1 — the `now` parameter is what makes
the memory limiter testable against synthetic time without an
indirection through clock-injection. The new `Limiter` interface +
`PostgresSlidingWindowLimiter` match the actual repo signature
(`Allow(key string, now time.Time) error`) byte-for-byte. Per CLAUDE.md
"the repo is truth" — the prompt is framing, the code is ground-truth.
Files added
===========
migrations/000046_rate_limit_buckets.up.sql + .down.sql:
- rate_limit_buckets(bucket_key TEXT PRIMARY KEY, timestamps
TIMESTAMPTZ[] NOT NULL DEFAULT '{}', updated_at TIMESTAMPTZ NOT
NULL DEFAULT NOW()).
- btree index on updated_at supports the Sprint 13.3 janitor sweep.
- All statements IF NOT EXISTS / DROP IF EXISTS per CLAUDE.md
"Idempotent migrations" rule.
internal/ratelimit/limiter.go (NEW, 53 LOC):
- Defines the `Limiter` interface with `Allow(key string,
now time.Time) error`.
- Compile-time satisfaction checks for both backends.
- Doc-comment documents the prompt-vs-repo signature reconciliation
+ the Sprint 13.3 backend-selector plan + why the interface stays
minimal (Disabled/Len are non-portable cross-backend; keeping them
off the interface avoids leaking implementation detail).
internal/ratelimit/postgres_sliding_window.go (NEW, 178 LOC):
- PostgresSlidingWindowLimiter struct + NewPostgresSlidingWindowLimiter
constructor + Allow + Disabled methods.
- Algorithm: BEGIN tx → INSERT ON CONFLICT DO NOTHING (ensures the
row exists) → SELECT ... FOR UPDATE (per-key row lock acquired
across the cluster) → prune in Go via the shared pruneOlderThan
helper (single source of truth for prune semantics) → decide
rate-limited or append → UPDATE → COMMIT.
- SELECT FOR UPDATE is what arbitrates across replicas. Replicas A
and B firing simultaneous Allow("k") never race because Postgres
serializes the row-lock; the memory backend's sync.Mutex only
arbitrates within a process.
- Same `maxN <= 0 → disabled` opt-out semantics as the memory
backend.
- Empty-key short-circuit (chokepoint avoidance) matches the memory
backend.
- Uses pq.Array for TIMESTAMPTZ[] marshalling (lib/pq is the
existing project driver).
internal/ratelimit/equivalence_test.go (NEW, 304 LOC):
- Backend-equivalence suite that runs the same scenario set against
both backends via the `Limiter` interface. 7 scenarios per
backend: AllowsUpToCap, DistinctKeysIndependent, WindowExpiry,
DisabledBypass, NegativeCapDisabled, EmptyKeyShortCircuits,
ConcurrentRaceFree.
- Memory half: TestSlidingWindowLimiter_Equivalence_Memory — runs
on every `go test ./...`.
- Postgres half: TestSlidingWindowLimiter_Equivalence_Postgres —
gated by `testing.Short()`; runs only when -short is omitted, so
`go test -race -short ./...` keeps fast.
- Schema-per-test isolation via testcontainers-go (mirrors the
pattern in internal/repository/postgres/testutil_test.go: setup
one container, fresh schema per subtest, search_path-pinned DSN).
- Memory equivalence half re-verifies the same behaviors pinned in
the pre-existing sliding_window_test.go but through the interface
— catches drift if SlidingWindowLimiter.Allow ever changes shape.
internal/integration/ratelimit_multi_replica_test.go (NEW, 159 LOC):
- The falsifiable ARCH-M1 closure proof, gated by //go:build
integration matching the rest of internal/integration/.
- Scenario: 1 postgres container shared across N=3 independent
*PostgresSlidingWindowLimiter instances (each replica's process
has its own *sql.DB pool to the same database, just like a real
HA deployment). 100 concurrent Allow("test-key") calls round-
robin across the 3 limiters via sync.WaitGroup. Cap = 10,
window = 1m, shared now-timestamp so the scenario is
deterministic.
- Assert: exactly 10 succeed + 90 return ErrRateLimited. If the
cross-replica row lock weren't arbitrating, each replica would
independently let through ~3-4 requests (10/3), giving 12-15
successes. The hard-pass on exactly-10 is what makes ARCH-M1
substantive.
What did NOT change
===================
- internal/ratelimit/sliding_window.go (the memory backend) is
byte-identical to its pre-Sprint-13.2 state. Same Mutex, same
Allow signature, same Len/Disabled/pruneOlderThan/evictOldestLocked.
Compile-time check in limiter.go pins that the memory backend
still satisfies the new interface.
- No call site in cmd/server, internal/api/handler, internal/service
changed. Sprint 13.3 owns the 5-site migration + the
CERTCTL_RATELIMIT_BACKEND env-var selector.
- No new operator dependency. Postgres is already required for
certctl-server to boot. Redis (Option M1-B) was declined by the
operator and is not introduced here.
Verification
============
$ ls migrations/000046_rate_limit_buckets.up.sql migrations/000046_rate_limit_buckets.down.sql
$ ls internal/ratelimit/limiter.go internal/ratelimit/postgres_sliding_window.go
$ grep -nE 'sync\.Mutex|sync\.RWMutex' internal/ratelimit/sliding_window.go
30:// by sync.Mutex; per-key slices mutated only while the mutex is
56: mu sync.Mutex
(memory backend untouched)
$ gofmt -l internal/ratelimit/ internal/integration/ → clean
$ go vet ./internal/ratelimit/... → clean
$ go vet -tags=integration ./internal/integration/... → clean
$ staticcheck ./internal/ratelimit/... → clean
$ go build ./... → clean
$ go build -tags=integration ./internal/integration/...→ clean
$ go test -race -short -count=1 ./internal/ratelimit/...
ok github.com/certctl-io/certctl/internal/ratelimit 1.028s
(memory equivalence + sliding_window_test.go both pass; postgres
equivalence skipped under -short as designed)
$ go doc ./internal/ratelimit/
type Limiter interface{ ... }
type PostgresSlidingWindowLimiter struct{ ... }
func NewPostgresSlidingWindowLimiter(db *sql.DB, maxN int,
window time.Duration) *PostgresSlidingWindowLimiter
type SlidingWindowLimiter struct{ ... }
func NewSlidingWindowLimiter(maxN int, window time.Duration,
mapCap int) *SlidingWindowLimiter
var ErrRateLimited = ...
(public surface matches the Sprint 13.2 prompt's required diff)
Sandbox note: the multi-replica integration test + the postgres
equivalence half run under testcontainers-go which requires docker-
in-docker. The CI integration job exercises both; local CI-equivalent
verification was build + vet + staticcheck + memory equivalence (the
sandbox /sessions partition is full so spinning a postgres container
locally isn't viable in this session). The Sprint 13.3 commit will
re-verify against the live integration job.
Next: Sprint 13.3 wires every call site through
ratelimit.NewLimiter(cfg.Server.RateLimitBackend, db, ...) +
introduces the scheduler janitor loop + rewrites the
docs/operator/observability.md "per-process" paragraph to describe
the configurable backend.
Refs: ARCH-M1 (HA / scale — rate limits per-process), Phase 13
Sprint 13.2.
196 lines
5.9 KiB
Go
196 lines
5.9 KiB
Go
// Copyright 2026 certctl LLC. All rights reserved.
|
|
// SPDX-License-Identifier: BUSL-1.1
|
|
|
|
//go:build integration
|
|
|
|
package integration
|
|
|
|
import (
|
|
"context"
|
|
"database/sql"
|
|
"errors"
|
|
"fmt"
|
|
"os"
|
|
"path/filepath"
|
|
"runtime"
|
|
"sync"
|
|
"sync/atomic"
|
|
"testing"
|
|
"time"
|
|
|
|
_ "github.com/lib/pq"
|
|
"github.com/testcontainers/testcontainers-go"
|
|
"github.com/testcontainers/testcontainers-go/wait"
|
|
|
|
"github.com/certctl-io/certctl/internal/ratelimit"
|
|
)
|
|
|
|
// Phase 13 Sprint 13.2 closure (2026-05-14, architecture diligence audit
|
|
// ARCH-M1) — the falsifiable closure proof for cross-replica rate-limit
|
|
// consistency.
|
|
//
|
|
// Scenario:
|
|
// - ONE postgres container (representing the shared backend).
|
|
// - N=3 independent *PostgresSlidingWindowLimiter instances pointing
|
|
// at it (representing 3 server replicas — each replica's process
|
|
// has its own constructed limiter, but they all share the same
|
|
// database state).
|
|
// - 100 concurrent Allow("test-key") calls spread across the 3
|
|
// limiters via sync.WaitGroup.
|
|
// - Assert: exactly 10 succeed + 90 return ErrRateLimited.
|
|
//
|
|
// If the postgres backend's SELECT FOR UPDATE serialization weren't
|
|
// arbitrating across the 3 limiters, more than 10 calls would be
|
|
// allowed (each replica would independently let through 10/3 ≈ 4
|
|
// requests, giving ~12-15 successes depending on scheduling). The
|
|
// hard-pass on exactly-10 is what makes ARCH-M1 closure substantive
|
|
// rather than wishful.
|
|
//
|
|
// Gated by //go:build integration matching the rest of
|
|
// internal/integration/. Sprint 13.3 promotes this test to a
|
|
// required CI status check.
|
|
|
|
func TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas(t *testing.T) {
|
|
const (
|
|
replicas = 3
|
|
cap = 10
|
|
window = 1 * time.Minute
|
|
concurrentReq = 100
|
|
key = "test-key"
|
|
)
|
|
|
|
ctx := context.Background()
|
|
|
|
// Boot a shared postgres container.
|
|
container, dsn := startPostgresContainer(ctx, t)
|
|
t.Cleanup(func() { _ = container.Terminate(context.Background()) })
|
|
|
|
// Each "replica" gets its own *sql.DB pool — same database, different
|
|
// connection pool — matching how N server processes would each open
|
|
// their own pool to the same control-plane database.
|
|
dbs := make([]*sql.DB, replicas)
|
|
for i := 0; i < replicas; i++ {
|
|
db, err := sql.Open("postgres", dsn)
|
|
if err != nil {
|
|
t.Fatalf("open db (replica %d): %v", i, err)
|
|
}
|
|
db.SetMaxOpenConns(8)
|
|
if err := db.Ping(); err != nil {
|
|
t.Fatalf("ping (replica %d): %v", i, err)
|
|
}
|
|
t.Cleanup(func() { db.Close() })
|
|
dbs[i] = db
|
|
}
|
|
|
|
// Apply the rate_limit_buckets migration via dbs[0]. All replicas
|
|
// see the same schema since they share the same database.
|
|
migPath := findMigrationFromHere("000046_rate_limit_buckets.up.sql")
|
|
body, err := os.ReadFile(migPath)
|
|
if err != nil {
|
|
t.Fatalf("read migration: %v", err)
|
|
}
|
|
if _, err := dbs[0].ExecContext(ctx, string(body)); err != nil {
|
|
t.Fatalf("apply migration: %v", err)
|
|
}
|
|
|
|
// Instantiate one limiter per replica.
|
|
limiters := make([]*ratelimit.PostgresSlidingWindowLimiter, replicas)
|
|
for i := 0; i < replicas; i++ {
|
|
limiters[i] = ratelimit.NewPostgresSlidingWindowLimiter(dbs[i], cap, window)
|
|
}
|
|
|
|
// Fire concurrentReq parallel Allow calls, round-robining across the
|
|
// replicas. Each call uses the SAME key + a SHARED `now` so the
|
|
// scenario is deterministic. The cross-replica row lock is what
|
|
// enforces the cap globally.
|
|
var (
|
|
allowed int64
|
|
denied int64
|
|
wg sync.WaitGroup
|
|
)
|
|
now := time.Now()
|
|
for i := 0; i < concurrentReq; i++ {
|
|
wg.Add(1)
|
|
go func(idx int) {
|
|
defer wg.Done()
|
|
l := limiters[idx%replicas]
|
|
err := l.Allow(key, now)
|
|
if err == nil {
|
|
atomic.AddInt64(&allowed, 1)
|
|
} else if errors.Is(err, ratelimit.ErrRateLimited) {
|
|
atomic.AddInt64(&denied, 1)
|
|
} else {
|
|
t.Errorf("unexpected error from Allow: %v", err)
|
|
}
|
|
}(i)
|
|
}
|
|
wg.Wait()
|
|
|
|
gotAllowed := atomic.LoadInt64(&allowed)
|
|
gotDenied := atomic.LoadInt64(&denied)
|
|
|
|
t.Logf("replicas=%d cap=%d concurrent=%d → allowed=%d denied=%d",
|
|
replicas, cap, concurrentReq, gotAllowed, gotDenied)
|
|
|
|
if gotAllowed != int64(cap) {
|
|
t.Errorf("allowed = %d, want exactly %d (cross-replica row lock should serialize Allow calls so exactly cap succeed)",
|
|
gotAllowed, cap)
|
|
}
|
|
if gotDenied != int64(concurrentReq-cap) {
|
|
t.Errorf("denied = %d, want %d (concurrentReq - cap)", gotDenied, concurrentReq-cap)
|
|
}
|
|
}
|
|
|
|
// ----------------------------------------------------------------
|
|
// Local testcontainers harness. Kept in-file because the rest of
|
|
// internal/integration/ uses HTTP-against-running-server smoke tests
|
|
// against a docker-compose stack — different shape from ours.
|
|
// ----------------------------------------------------------------
|
|
|
|
func startPostgresContainer(ctx context.Context, t *testing.T) (testcontainers.Container, string) {
|
|
t.Helper()
|
|
|
|
req := testcontainers.ContainerRequest{
|
|
Image: "postgres:16-alpine",
|
|
ExposedPorts: []string{"5432/tcp"},
|
|
Env: map[string]string{
|
|
"POSTGRES_DB": "certctl_test",
|
|
"POSTGRES_USER": "certctl",
|
|
"POSTGRES_PASSWORD": "certctl",
|
|
},
|
|
WaitingFor: wait.ForLog("database system is ready to accept connections").WithOccurrence(2),
|
|
}
|
|
container, err := testcontainers.GenericContainer(ctx, testcontainers.GenericContainerRequest{
|
|
ContainerRequest: req,
|
|
Started: true,
|
|
})
|
|
if err != nil {
|
|
t.Fatalf("start postgres container: %v", err)
|
|
}
|
|
|
|
host, err := container.Host(ctx)
|
|
if err != nil {
|
|
t.Fatalf("container host: %v", err)
|
|
}
|
|
port, err := container.MappedPort(ctx, "5432")
|
|
if err != nil {
|
|
t.Fatalf("container port: %v", err)
|
|
}
|
|
dsn := fmt.Sprintf("postgres://certctl:certctl@%s:%s/certctl_test?sslmode=disable",
|
|
host, port.Port())
|
|
return container, dsn
|
|
}
|
|
|
|
func findMigrationFromHere(filename string) string {
|
|
_, here, _, _ := runtime.Caller(0)
|
|
dir := filepath.Dir(here)
|
|
for i := 0; i < 6; i++ {
|
|
candidate := filepath.Join(dir, "migrations", filename)
|
|
if _, err := os.Stat(candidate); err == nil {
|
|
return candidate
|
|
}
|
|
dir = filepath.Dir(dir)
|
|
}
|
|
return ""
|
|
}
|