acme-server: cert-manager integration test + production hardening (Phase 5/7)

Closes the production-readiness loop on the ACME surface. After this commit, certctl ships per-account rate limits + a GC sweeper for expired ACME state + a kind-driven cert-manager 1.15 integration test + a lego-driven RFC conformance harness + a k6 loadtest scenario for the unauthenticated ACME path. Architecture: - Rate limits live in-memory + per-replica. Restart wipes the counters; orders/hour caps are eventual-consistency anyway. A 3-replica certctl-server fleet behind an LB effectively has 3x the configured throughput per account; persistent rate limiting is a follow-up if production telemetry shows abuse patterns we can't catch in a single restart cycle. Per-key + per-action isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and ActionChallengeRespond/<challenge-id> are independent buckets. - GC loop follows the existing scheduler-loop pattern (atomic.Bool + sync.WaitGroup; see crlGenerationLoop for shape). Three independent SQL sweeps per tick (DELETE expired nonces; UPDATE pending authzs whose expires_at < now() to expired; UPDATE pending/ready/processing orders whose expires_at < now() to invalid). Each sweep is a single statement; failures are logged- and-continued so a failing nonces sweep doesn't block authzs. Per-sweep 1m timeout bounds a stuck Postgres. - cert-manager integration test is gated on KIND_AVAILABLE so CI skips it cleanly (kind is too heavy for per-PR). Operators run locally via 'make acme-cert-manager-test'; the harness brings up a fresh cluster each run + tears it down on Cleanup. - lego conformance harness drives a real ACME client through register → run → cert-PEM-landed against a hermetic certctl stack. Catches RFC-shape regressions third-party clients would hit before they ship. - k6 ACME-flow scenario hammers the unauthenticated surface (directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS- signed flows are out of scope for k6 (no JWS support); they're covered by the lego harness above. What ships: - internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases — disable-when-perHour-zero, capacity, per-key isolation, per- action isolation, refill-over-time, RetryAfter, concurrent-access with -race + 200 goroutines × 200 calls). - internal/repository/postgres/acme.go: 4 new methods — CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations + GCInvalidateExpiredOrders. Each a single SQL statement. - internal/service/acme.go: SetRateLimiter + GarbageCollect + rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey + RespondToChallenge) + concurrent-orders gate at CreateOrder. 2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded); 5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped / gc_authzs_expired / gc_orders_invalidated). - internal/scheduler/scheduler.go: ACMEGarbageCollector interface + acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME- GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the crlGenerationLoop shape. - internal/api/handler/acme.go: writeServiceError gains rateLimited (429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings. - internal/config/config.go: 5 new env vars (CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100, CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5, CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5, CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60, CERTCTL_ACME_SERVER_GC_INTERVAL=1m). - cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at startup; conditional SetACMEGarbageCollector(acmeService) + SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+ GCInterval > 0. - deploy/test/acme-integration/: kind-config.yaml + cert-manager- install.sh + clusterissuer-trust-authenticated.yaml + clusterissuer-challenge.yaml + certificate-test.yaml + conformance- lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE gate). - deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section. - Makefile: 2 new PHONY targets (acme-cert-manager-test + acme-rfc-conformance-test). - docs/acme-server.md: status flipped to Phase 5; Configuration table grows 5 rows; new 'Phase 5 — operational guidance' section explaining rate-limit math + GC sweeper semantics + cert-manager integration + lego conformance + k6 baseline. Tests: - 'go vet ./...' clean across the repo. - 'go test -short -count=1 ./internal/...' green across every affected package (service / acme / handler / scheduler / repo / config). - 'go vet -tags=integration ./deploy/test/acme-integration/' clean (the integration test compiles cleanly with the build tag). - The kind/cert-manager harness is gated behind KIND_AVAILABLE so CI skips by default; operators run locally via 'make acme-cert- manager-test'. Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
2026-06-07 15:01:32 +00:00 · 2026-05-03 19:42:03 +00:00
parent 9bfbac0f97
commit bee47f0318
20 changed files with 1341 additions and 21 deletions
@@ -0,0 +1,166 @@
+// Copyright (c) certctl
+// SPDX-License-Identifier: BSL-1.1
+
+package acme
+
+import (
+	"errors"
+	"sync"
+	"time"
+)
+
+// Phase 5 — per-account rolling-hour rate limiter for ACME operations.
+//
+// Architecture:
+//   - In-memory token-bucket per (key, action). Restart wipes the
+//     buckets; orders/hour caps are eventual-consistency so this is
+//     acceptable. Persistent rate limiting is a follow-up if production
+//     telemetry shows abuse patterns we can't catch in a single restart
+//     cycle (master prompt criterion #11 explicitly accepts this).
+//   - Tokens-per-hour math: bucket capacity = perHour, refill rate =
+//     perHour / 3600 tokens/sec. A fresh bucket starts full; an over-
+//     limit caller drains it then has to wait for replenishment.
+//   - Key shape is action-specific: orders use accountID; key-rollover
+//     uses accountID; challenge-respond uses challengeID (so a flood
+//     against one challenge doesn't burn the whole account's budget).
+//
+// Concurrency: the outer map is RWMutex-guarded for create-on-demand;
+// per-bucket allow() takes a tiny per-bucket Mutex. Mirrors the
+// existing internal/api/middleware/middleware.go::keyedRateLimiter
+// pattern (different scope, same shape).
+
+// RateLimiter is the per-action token-bucket pool. Construct with
+// NewRateLimiter(); pass a single instance into ACMEService via
+// SetRateLimiter so all entry points share the same buckets.
+type RateLimiter struct {
+	mu      sync.RWMutex
+	buckets map[string]*rlBucket // keyed by "<action>|<keyID>"
+	clock   func() time.Time     // injectable for tests
+}
+
+// NewRateLimiter returns an empty RateLimiter. Buckets are created on
+// first reference, so a fresh limiter does no work until traffic
+// arrives.
+func NewRateLimiter() *RateLimiter {
+	return &RateLimiter{
+		buckets: make(map[string]*rlBucket),
+		clock:   time.Now,
+	}
+}
+
+// SetClock replaces the clock for tests. Production callers leave it
+// pointing at time.Now (the constructor default).
+func (r *RateLimiter) SetClock(now func() time.Time) {
+	if now != nil {
+		r.clock = now
+	}
+}
+
+// Allow returns true when the (action, keyID) bucket has at least one
+// token available — and consumes that token. perHour=0 disables the
+// limit (always true). Negative perHour is treated as 0.
+//
+// On hit (first call → first token consumed → returns true). Once
+// drained, further calls within the same hour return false until
+// elapsed-time refills the bucket.
+func (r *RateLimiter) Allow(action, keyID string, perHour int) bool {
+	if perHour <= 0 {
+		return true
+	}
+	bucketKey := action + "|" + keyID
+	r.mu.RLock()
+	b, ok := r.buckets[bucketKey]
+	r.mu.RUnlock()
+	if !ok {
+		r.mu.Lock()
+		b, ok = r.buckets[bucketKey]
+		if !ok {
+			b = &rlBucket{
+				capacity:   float64(perHour),
+				refillRate: float64(perHour) / 3600.0, // tokens/sec
+				tokens:     float64(perHour),
+				lastRefill: r.clock(),
+			}
+			r.buckets[bucketKey] = b
+		}
+		r.mu.Unlock()
+	}
+	return b.allow(r.clock)
+}
+
+// RetryAfter returns the duration the caller should wait before the
+// (action, keyID) bucket has at least one token again. Returns 0 when
+// at least one token is currently available. Used by the handler to
+// emit a Retry-After header on rateLimited responses.
+func (r *RateLimiter) RetryAfter(action, keyID string, perHour int) time.Duration {
+	if perHour <= 0 {
+		return 0
+	}
+	bucketKey := action + "|" + keyID
+	r.mu.RLock()
+	b, ok := r.buckets[bucketKey]
+	r.mu.RUnlock()
+	if !ok {
+		return 0
+	}
+	b.mu.Lock()
+	defer b.mu.Unlock()
+	if b.tokens >= 1 {
+		return 0
+	}
+	missing := 1 - b.tokens
+	if b.refillRate <= 0 {
+		// Shouldn't happen (Allow rejects perHour<=0 before bucket
+		// creation), but a divide-by-zero here would panic.
+		return time.Hour
+	}
+	secs := missing / b.refillRate
+	return time.Duration(secs * float64(time.Second))
+}
+
+// rlBucket is the per-(action, keyID) token bucket. Mirrors the shape
+// of internal/api/middleware/middleware.go::tokenBucket but with a
+// per-hour-shaped refill instead of per-second.
+type rlBucket struct {
+	mu         sync.Mutex
+	capacity   float64
+	refillRate float64 // tokens per second
+	tokens     float64
+	lastRefill time.Time
+}
+
+func (b *rlBucket) allow(clock func() time.Time) bool {
+	b.mu.Lock()
+	defer b.mu.Unlock()
+
+	now := clock()
+	// Monotonic-clock-safe via t.Sub(t) per Go time-package contract.
+	elapsed := now.Sub(b.lastRefill).Seconds()
+	if elapsed > 0 {
+		b.tokens += elapsed * b.refillRate
+		if b.tokens > b.capacity {
+			b.tokens = b.capacity
+		}
+		b.lastRefill = now
+	}
+	if b.tokens < 1 {
+		return false
+	}
+	b.tokens--
+	return true
+}
+
+// Action constants — keep one source of truth for the bucket-key
+// `<action>|...` prefix. Using untyped consts (not iota) so they
+// survive cross-process coordination if a follow-up adds shared-state
+// rate-limiting.
+const (
+	ActionNewOrder         = "new_order"
+	ActionKeyChange        = "key_change"
+	ActionChallengeRespond = "challenge_respond"
+)
+
+// ErrRateLimited is the sentinel service-layer entry points return on
+// a hit. Handler maps to RFC 7807 + RFC 8555 §6.7
+// `urn:ietf:params:acme:error:rateLimited` with Retry-After.
+var ErrRateLimited = errors.New("acme: rate limit exceeded")
@@ -0,0 +1,159 @@
+// Copyright (c) certctl
+// SPDX-License-Identifier: BSL-1.1
+
+package acme
+
+import (
+	"sync"
+	"testing"
+	"time"
+)
+
+// Phase 5 — RateLimiter unit tests.
+
+func TestRateLimiter_DisabledWhenPerHourZero(t *testing.T) {
+	r := NewRateLimiter()
+	for i := 0; i < 10000; i++ {
+		if !r.Allow(ActionNewOrder, "acc-1", 0) {
+			t.Fatalf("Allow returned false on call %d with perHour=0", i)
+		}
+	}
+}
+
+func TestRateLimiter_DisabledWhenPerHourNegative(t *testing.T) {
+	r := NewRateLimiter()
+	if !r.Allow(ActionNewOrder, "acc-1", -5) {
+		t.Errorf("Allow returned false with perHour=-5; expected always-allow")
+	}
+}
+
+func TestRateLimiter_BucketCapacity(t *testing.T) {
+	// Frozen clock: a fresh bucket has perHour tokens. Drain exactly
+	// that many; the next call must return false.
+	now := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
+	r := NewRateLimiter()
+	r.SetClock(func() time.Time { return now })
+
+	for i := 0; i < 100; i++ {
+		if !r.Allow(ActionNewOrder, "acc-1", 100) {
+			t.Fatalf("Allow returned false on call %d (within capacity)", i)
+		}
+	}
+	if r.Allow(ActionNewOrder, "acc-1", 100) {
+		t.Errorf("Allow returned true on the 101st call; expected limit hit")
+	}
+}
+
+func TestRateLimiter_PerKeyIsolation(t *testing.T) {
+	// Frozen clock — drain acc-1 to zero, then acc-2 should still have
+	// a full bucket (separate key).
+	now := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
+	r := NewRateLimiter()
+	r.SetClock(func() time.Time { return now })
+
+	for i := 0; i < 100; i++ {
+		_ = r.Allow(ActionNewOrder, "acc-1", 100)
+	}
+	if r.Allow(ActionNewOrder, "acc-1", 100) {
+		t.Errorf("acc-1 should be rate-limited")
+	}
+	if !r.Allow(ActionNewOrder, "acc-2", 100) {
+		t.Errorf("acc-2 should be unaffected by acc-1's bucket; expected allow")
+	}
+}
+
+func TestRateLimiter_PerActionIsolation(t *testing.T) {
+	// Same key but different actions get different buckets.
+	now := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
+	r := NewRateLimiter()
+	r.SetClock(func() time.Time { return now })
+
+	for i := 0; i < 5; i++ {
+		_ = r.Allow(ActionKeyChange, "acc-1", 5)
+	}
+	if r.Allow(ActionKeyChange, "acc-1", 5) {
+		t.Errorf("ActionKeyChange should be rate-limited")
+	}
+	// ActionNewOrder for the same key has its own (empty) bucket.
+	if !r.Allow(ActionNewOrder, "acc-1", 100) {
+		t.Errorf("ActionNewOrder for same key should be allowed (different bucket)")
+	}
+}
+
+func TestRateLimiter_RefillOverTime(t *testing.T) {
+	// Drain bucket; advance the clock; expect tokens replenished.
+	current := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
+	r := NewRateLimiter()
+	r.SetClock(func() time.Time { return current })
+
+	for i := 0; i < 100; i++ {
+		_ = r.Allow(ActionNewOrder, "acc-1", 100)
+	}
+	if r.Allow(ActionNewOrder, "acc-1", 100) {
+		t.Fatalf("expected limit hit after draining bucket")
+	}
+	// Advance by 36 seconds: at 100/hour = 100/3600 tokens/sec ≈
+	// 0.0278/sec. 36 * 0.0278 = 1.00 tokens — exactly enough for 1
+	// more call.
+	current = current.Add(36 * time.Second)
+	if !r.Allow(ActionNewOrder, "acc-1", 100) {
+		t.Errorf("Allow returned false after 36s elapsed; expected ≥1 token replenished")
+	}
+}
+
+func TestRateLimiter_RetryAfter(t *testing.T) {
+	now := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
+	r := NewRateLimiter()
+	r.SetClock(func() time.Time { return now })
+
+	// Drain to zero.
+	for i := 0; i < 100; i++ {
+		_ = r.Allow(ActionNewOrder, "acc-1", 100)
+	}
+	d := r.RetryAfter(ActionNewOrder, "acc-1", 100)
+	// 1 token at 100/hour = 36 seconds.
+	if d < 35*time.Second || d > 37*time.Second {
+		t.Errorf("RetryAfter = %v, expected ~36s", d)
+	}
+	// Allow above capacity — RetryAfter returns 0 on a fresh bucket.
+	if zero := r.RetryAfter(ActionNewOrder, "acc-fresh", 100); zero != 0 {
+		t.Errorf("RetryAfter for fresh bucket = %v, expected 0", zero)
+	}
+}
+
+func TestRateLimiter_ConcurrentAccess(t *testing.T) {
+	// Hammer 200 goroutines × 200 calls each = 40000 calls against a
+	// 1000-token bucket; assert no panic, no data race (run with -race),
+	// and that no more than 1000 calls succeeded.
+	now := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
+	r := NewRateLimiter()
+	r.SetClock(func() time.Time { return now })
+
+	var (
+		wg      sync.WaitGroup
+		success int64
+		mu      sync.Mutex
+	)
+	for g := 0; g < 200; g++ {
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+			local := int64(0)
+			for i := 0; i < 200; i++ {
+				if r.Allow(ActionNewOrder, "shared-acc", 1000) {
+					local++
+				}
+			}
+			mu.Lock()
+			success += local
+			mu.Unlock()
+		}()
+	}
+	wg.Wait()
+	if success > 1000 {
+		t.Errorf("got %d successes, want ≤ 1000 (bucket capacity)", success)
+	}
+	if success < 1000 {
+		t.Errorf("got %d successes, want exactly 1000 (frozen clock, no refill)", success)
+	}
+}