acme-server: cert-manager integration test + production hardening (Phase 5/7)

Closes the production-readiness loop on the ACME surface. After this commit, certctl ships per-account rate limits + a GC sweeper for expired ACME state + a kind-driven cert-manager 1.15 integration test + a lego-driven RFC conformance harness + a k6 loadtest scenario for the unauthenticated ACME path. Architecture: - Rate limits live in-memory + per-replica. Restart wipes the counters; orders/hour caps are eventual-consistency anyway. A 3-replica certctl-server fleet behind an LB effectively has 3x the configured throughput per account; persistent rate limiting is a follow-up if production telemetry shows abuse patterns we can't catch in a single restart cycle. Per-key + per-action isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and ActionChallengeRespond/<challenge-id> are independent buckets. - GC loop follows the existing scheduler-loop pattern (atomic.Bool + sync.WaitGroup; see crlGenerationLoop for shape). Three independent SQL sweeps per tick (DELETE expired nonces; UPDATE pending authzs whose expires_at < now() to expired; UPDATE pending/ready/processing orders whose expires_at < now() to invalid). Each sweep is a single statement; failures are logged- and-continued so a failing nonces sweep doesn't block authzs. Per-sweep 1m timeout bounds a stuck Postgres. - cert-manager integration test is gated on KIND_AVAILABLE so CI skips it cleanly (kind is too heavy for per-PR). Operators run locally via 'make acme-cert-manager-test'; the harness brings up a fresh cluster each run + tears it down on Cleanup. - lego conformance harness drives a real ACME client through register → run → cert-PEM-landed against a hermetic certctl stack. Catches RFC-shape regressions third-party clients would hit before they ship. - k6 ACME-flow scenario hammers the unauthenticated surface (directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS- signed flows are out of scope for k6 (no JWS support); they're covered by the lego harness above. What ships: - internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases — disable-when-perHour-zero, capacity, per-key isolation, per- action isolation, refill-over-time, RetryAfter, concurrent-access with -race + 200 goroutines × 200 calls). - internal/repository/postgres/acme.go: 4 new methods — CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations + GCInvalidateExpiredOrders. Each a single SQL statement. - internal/service/acme.go: SetRateLimiter + GarbageCollect + rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey + RespondToChallenge) + concurrent-orders gate at CreateOrder. 2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded); 5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped / gc_authzs_expired / gc_orders_invalidated). - internal/scheduler/scheduler.go: ACMEGarbageCollector interface + acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME- GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the crlGenerationLoop shape. - internal/api/handler/acme.go: writeServiceError gains rateLimited (429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings. - internal/config/config.go: 5 new env vars (CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100, CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5, CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5, CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60, CERTCTL_ACME_SERVER_GC_INTERVAL=1m). - cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at startup; conditional SetACMEGarbageCollector(acmeService) + SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+ GCInterval > 0. - deploy/test/acme-integration/: kind-config.yaml + cert-manager- install.sh + clusterissuer-trust-authenticated.yaml + clusterissuer-challenge.yaml + certificate-test.yaml + conformance- lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE gate). - deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section. - Makefile: 2 new PHONY targets (acme-cert-manager-test + acme-rfc-conformance-test). - docs/acme-server.md: status flipped to Phase 5; Configuration table grows 5 rows; new 'Phase 5 — operational guidance' section explaining rate-limit math + GC sweeper semantics + cert-manager integration + lego conformance + k6 baseline. Tests: - 'go vet ./...' clean across the repo. - 'go test -short -count=1 ./internal/...' green across every affected package (service / acme / handler / scheduler / repo / config). - 'go vet -tags=integration ./deploy/test/acme-integration/' clean (the integration test compiles cleanly with the build tag). - The kind/cert-manager harness is gated behind KIND_AVAILABLE so CI skips by default; operators run locally via 'make acme-cert- manager-test'. Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
2026-06-07 13:51:36 +00:00 · 2026-05-03 19:42:03 +00:00
parent 9bfbac0f97
commit bee47f0318
20 changed files with 1341 additions and 21 deletions
@@ -77,6 +77,13 @@ type CRLCacheServicer interface {
 	RegenerateAll(ctx context.Context)
 }

+// ACMEGarbageCollector is the interface the scheduler's acmeGCLoop
+// invokes once per tick. The concrete implementation is *service.ACMEService.
+// Phase 5 — sweeps expired nonces / authzs / orders.
+type ACMEGarbageCollector interface {
+	GarbageCollect(ctx context.Context) error
+}
+
 // JobReaperService defines the interface for job timeout reaping used by the scheduler.
 type JobReaperService interface {
 	ReapTimedOutJobs(ctx context.Context, csrTTL, approvalTTL time.Duration) error
@@ -101,6 +108,7 @@ type Scheduler struct {
 	healthCheckService    HealthCheckServicer
 	cloudDiscoveryService CloudDiscoveryServicer
 	crlCacheService       CRLCacheServicer
+	acmeGC                ACMEGarbageCollector
 	jobReaper             JobReaperService
 	logger                *slog.Logger

@@ -118,6 +126,7 @@ type Scheduler struct {
 	cloudDiscoveryInterval        time.Duration
 	crlGenerationInterval         time.Duration
 	jobTimeoutInterval            time.Duration
+	acmeGCInterval                time.Duration
 	// agentOfflineJobTTL: per-tick threshold for reaping Running jobs whose
 	// owning agent has been silent. Bundle C / Audit M-016. Defaults below.
 	agentOfflineJobTTL      time.Duration
@@ -138,6 +147,7 @@ type Scheduler struct {
 	cloudDiscoveryRunning        atomic.Bool
 	crlGenerationRunning         atomic.Bool
 	jobTimeoutRunning            atomic.Bool
+	acmeGCRunning                atomic.Bool

 	// Graceful shutdown: wait for in-flight work to complete
 	wg sync.WaitGroup
@@ -174,6 +184,7 @@ func NewScheduler(
 		cloudDiscoveryInterval:        6 * time.Hour,
 		crlGenerationInterval:         1 * time.Hour,
 		jobTimeoutInterval:            10 * time.Minute,
+		acmeGCInterval:                1 * time.Minute,
 		// 5 minutes is 5×agentHealthCheckInterval default of 1m; an agent
 		// must miss multiple heartbeats before its in-flight jobs are reaped.
 		agentOfflineJobTTL: 5 * time.Minute,
@@ -287,6 +298,25 @@ func (s *Scheduler) SetJobReaperService(jr JobReaperService) {
 	s.jobReaper = jr
 }

+// SetACMEGarbageCollector wires the ACME GC service. Phase 5 — when
+// non-nil, an acmeGCLoop runs every acmeGCInterval and sweeps expired
+// nonces / authzs / orders. Optional: leaving nil disables the loop
+// (legacy behavior pre-Phase-5).
+func (s *Scheduler) SetACMEGarbageCollector(gc ACMEGarbageCollector) {
+	s.acmeGC = gc
+}
+
+// SetACMEGCInterval configures the interval at which the ACME GC sweep
+// runs. Default 1m. Operators with quiet fleets can lengthen to 5m;
+// operators expecting nonce-storms can shorten to 30s. Zero or
+// negative values are ignored.
+func (s *Scheduler) SetACMEGCInterval(d time.Duration) {
+	if d <= 0 {
+		return
+	}
+	s.acmeGCInterval = d
+}
+
 // SetAgentOfflineJobTTL sets the threshold past which a Running job whose
 // owning agent has gone silent is reaped to Failed. Bundle C / Audit M-016.
 // Zero or negative values are ignored (the default of 5 minutes is kept).
@@ -342,6 +372,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
 		if s.crlCacheService != nil {
 			loopCount++
 		}
+		if s.acmeGC != nil {
+			loopCount++
+		}
 		s.wg.Add(loopCount)

 		go func() { defer s.wg.Done(); s.renewalCheckLoop(ctx) }()
@@ -367,6 +400,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
 		if s.crlCacheService != nil {
 			go func() { defer s.wg.Done(); s.crlGenerationLoop(ctx) }()
 		}
+		if s.acmeGC != nil {
+			go func() { defer s.wg.Done(); s.acmeGCLoop(ctx) }()
+		}

 		// Signal that all loops are launched
 		close(startedChan)
@@ -1074,3 +1110,39 @@ func (s *Scheduler) runCRLGeneration(ctx context.Context) {

 // ErrSchedulerShutdownTimeout is returned when scheduler graceful shutdown times out.
 var ErrSchedulerShutdownTimeout = errors.New("scheduler graceful shutdown timeout")
+
+// acmeGCLoop runs every acmeGCInterval and invokes ACMEGarbageCollector.
+// Per CLAUDE.md "Scheduler idempotency" architecture decision: an
+// atomic.Bool guard prevents concurrent tick execution; the
+// sync.WaitGroup tracks the in-flight goroutine for graceful shutdown.
+// Phase 5.
+func (s *Scheduler) acmeGCLoop(ctx context.Context) {
+	ticker := time.NewTicker(s.acmeGCInterval)
+	defer ticker.Stop()
+
+	for {
+		select {
+		case <-ctx.Done():
+			return
+		case <-ticker.C:
+			if !s.acmeGCRunning.CompareAndSwap(false, true) {
+				s.logger.Warn("ACME GC sweep still running, skipping tick")
+				continue
+			}
+			s.wg.Add(1)
+			go func() {
+				defer s.wg.Done()
+				defer s.acmeGCRunning.Store(false)
+				// 1-minute timeout per sweep — the per-statement work is
+				// cheap (single DELETE / UPDATE per sweep, all on indexed
+				// columns), but bound the cycle so a stuck Postgres can't
+				// block the next tick.
+				opCtx, cancel := context.WithTimeout(ctx, time.Minute)
+				defer cancel()
+				if err := s.acmeGC.GarbageCollect(opCtx); err != nil {
+					s.logger.Warn("acme gc sweep failed (next tick will retry)", "error", err)
+				}
+			}()
+		}
+	}
+}