mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 13:51:36 +00:00
acme-server: cert-manager integration test + production hardening (Phase 5/7)
Closes the production-readiness loop on the ACME surface. After this
commit, certctl ships per-account rate limits + a GC sweeper for
expired ACME state + a kind-driven cert-manager 1.15 integration test
+ a lego-driven RFC conformance harness + a k6 loadtest scenario for
the unauthenticated ACME path.
Architecture:
- Rate limits live in-memory + per-replica. Restart wipes the
counters; orders/hour caps are eventual-consistency anyway. A
3-replica certctl-server fleet behind an LB effectively has 3x
the configured throughput per account; persistent rate limiting
is a follow-up if production telemetry shows abuse patterns we
can't catch in a single restart cycle. Per-key + per-action
isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and
ActionChallengeRespond/<challenge-id> are independent buckets.
- GC loop follows the existing scheduler-loop pattern (atomic.Bool
+ sync.WaitGroup; see crlGenerationLoop for shape). Three
independent SQL sweeps per tick (DELETE expired nonces; UPDATE
pending authzs whose expires_at < now() to expired; UPDATE
pending/ready/processing orders whose expires_at < now() to
invalid). Each sweep is a single statement; failures are logged-
and-continued so a failing nonces sweep doesn't block authzs.
Per-sweep 1m timeout bounds a stuck Postgres.
- cert-manager integration test is gated on KIND_AVAILABLE so CI
skips it cleanly (kind is too heavy for per-PR). Operators run
locally via 'make acme-cert-manager-test'; the harness brings up
a fresh cluster each run + tears it down on Cleanup.
- lego conformance harness drives a real ACME client through
register → run → cert-PEM-landed against a hermetic certctl
stack. Catches RFC-shape regressions third-party clients would
hit before they ship.
- k6 ACME-flow scenario hammers the unauthenticated surface
(directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS-
signed flows are out of scope for k6 (no JWS support); they're
covered by the lego harness above.
What ships:
- internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases —
disable-when-perHour-zero, capacity, per-key isolation, per-
action isolation, refill-over-time, RetryAfter, concurrent-access
with -race + 200 goroutines × 200 calls).
- internal/repository/postgres/acme.go: 4 new methods —
CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations
+ GCInvalidateExpiredOrders. Each a single SQL statement.
- internal/service/acme.go: SetRateLimiter + GarbageCollect +
rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey
+ RespondToChallenge) + concurrent-orders gate at CreateOrder.
2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded);
5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped /
gc_authzs_expired / gc_orders_invalidated).
- internal/scheduler/scheduler.go: ACMEGarbageCollector interface +
acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME-
GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the
crlGenerationLoop shape.
- internal/api/handler/acme.go: writeServiceError gains rateLimited
(429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings.
- internal/config/config.go: 5 new env vars
(CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100,
CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5,
CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5,
CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60,
CERTCTL_ACME_SERVER_GC_INTERVAL=1m).
- cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at
startup; conditional SetACMEGarbageCollector(acmeService) +
SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+
GCInterval > 0.
- deploy/test/acme-integration/: kind-config.yaml + cert-manager-
install.sh + clusterissuer-trust-authenticated.yaml +
clusterissuer-challenge.yaml + certificate-test.yaml + conformance-
lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE
gate).
- deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section.
- Makefile: 2 new PHONY targets (acme-cert-manager-test +
acme-rfc-conformance-test).
- docs/acme-server.md: status flipped to Phase 5; Configuration
table grows 5 rows; new 'Phase 5 — operational guidance' section
explaining rate-limit math + GC sweeper semantics + cert-manager
integration + lego conformance + k6 baseline.
Tests:
- 'go vet ./...' clean across the repo.
- 'go test -short -count=1 ./internal/...' green across every
affected package (service / acme / handler / scheduler / repo /
config).
- 'go vet -tags=integration ./deploy/test/acme-integration/' clean
(the integration test compiles cleanly with the build tag).
- The kind/cert-manager harness is gated behind KIND_AVAILABLE so
CI skips by default; operators run locally via 'make acme-cert-
manager-test'.
Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
This commit is contained in:
@@ -77,6 +77,13 @@ type CRLCacheServicer interface {
|
||||
RegenerateAll(ctx context.Context)
|
||||
}
|
||||
|
||||
// ACMEGarbageCollector is the interface the scheduler's acmeGCLoop
|
||||
// invokes once per tick. The concrete implementation is *service.ACMEService.
|
||||
// Phase 5 — sweeps expired nonces / authzs / orders.
|
||||
type ACMEGarbageCollector interface {
|
||||
GarbageCollect(ctx context.Context) error
|
||||
}
|
||||
|
||||
// JobReaperService defines the interface for job timeout reaping used by the scheduler.
|
||||
type JobReaperService interface {
|
||||
ReapTimedOutJobs(ctx context.Context, csrTTL, approvalTTL time.Duration) error
|
||||
@@ -101,6 +108,7 @@ type Scheduler struct {
|
||||
healthCheckService HealthCheckServicer
|
||||
cloudDiscoveryService CloudDiscoveryServicer
|
||||
crlCacheService CRLCacheServicer
|
||||
acmeGC ACMEGarbageCollector
|
||||
jobReaper JobReaperService
|
||||
logger *slog.Logger
|
||||
|
||||
@@ -118,6 +126,7 @@ type Scheduler struct {
|
||||
cloudDiscoveryInterval time.Duration
|
||||
crlGenerationInterval time.Duration
|
||||
jobTimeoutInterval time.Duration
|
||||
acmeGCInterval time.Duration
|
||||
// agentOfflineJobTTL: per-tick threshold for reaping Running jobs whose
|
||||
// owning agent has been silent. Bundle C / Audit M-016. Defaults below.
|
||||
agentOfflineJobTTL time.Duration
|
||||
@@ -138,6 +147,7 @@ type Scheduler struct {
|
||||
cloudDiscoveryRunning atomic.Bool
|
||||
crlGenerationRunning atomic.Bool
|
||||
jobTimeoutRunning atomic.Bool
|
||||
acmeGCRunning atomic.Bool
|
||||
|
||||
// Graceful shutdown: wait for in-flight work to complete
|
||||
wg sync.WaitGroup
|
||||
@@ -174,6 +184,7 @@ func NewScheduler(
|
||||
cloudDiscoveryInterval: 6 * time.Hour,
|
||||
crlGenerationInterval: 1 * time.Hour,
|
||||
jobTimeoutInterval: 10 * time.Minute,
|
||||
acmeGCInterval: 1 * time.Minute,
|
||||
// 5 minutes is 5×agentHealthCheckInterval default of 1m; an agent
|
||||
// must miss multiple heartbeats before its in-flight jobs are reaped.
|
||||
agentOfflineJobTTL: 5 * time.Minute,
|
||||
@@ -287,6 +298,25 @@ func (s *Scheduler) SetJobReaperService(jr JobReaperService) {
|
||||
s.jobReaper = jr
|
||||
}
|
||||
|
||||
// SetACMEGarbageCollector wires the ACME GC service. Phase 5 — when
|
||||
// non-nil, an acmeGCLoop runs every acmeGCInterval and sweeps expired
|
||||
// nonces / authzs / orders. Optional: leaving nil disables the loop
|
||||
// (legacy behavior pre-Phase-5).
|
||||
func (s *Scheduler) SetACMEGarbageCollector(gc ACMEGarbageCollector) {
|
||||
s.acmeGC = gc
|
||||
}
|
||||
|
||||
// SetACMEGCInterval configures the interval at which the ACME GC sweep
|
||||
// runs. Default 1m. Operators with quiet fleets can lengthen to 5m;
|
||||
// operators expecting nonce-storms can shorten to 30s. Zero or
|
||||
// negative values are ignored.
|
||||
func (s *Scheduler) SetACMEGCInterval(d time.Duration) {
|
||||
if d <= 0 {
|
||||
return
|
||||
}
|
||||
s.acmeGCInterval = d
|
||||
}
|
||||
|
||||
// SetAgentOfflineJobTTL sets the threshold past which a Running job whose
|
||||
// owning agent has gone silent is reaped to Failed. Bundle C / Audit M-016.
|
||||
// Zero or negative values are ignored (the default of 5 minutes is kept).
|
||||
@@ -342,6 +372,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
|
||||
if s.crlCacheService != nil {
|
||||
loopCount++
|
||||
}
|
||||
if s.acmeGC != nil {
|
||||
loopCount++
|
||||
}
|
||||
s.wg.Add(loopCount)
|
||||
|
||||
go func() { defer s.wg.Done(); s.renewalCheckLoop(ctx) }()
|
||||
@@ -367,6 +400,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
|
||||
if s.crlCacheService != nil {
|
||||
go func() { defer s.wg.Done(); s.crlGenerationLoop(ctx) }()
|
||||
}
|
||||
if s.acmeGC != nil {
|
||||
go func() { defer s.wg.Done(); s.acmeGCLoop(ctx) }()
|
||||
}
|
||||
|
||||
// Signal that all loops are launched
|
||||
close(startedChan)
|
||||
@@ -1074,3 +1110,39 @@ func (s *Scheduler) runCRLGeneration(ctx context.Context) {
|
||||
|
||||
// ErrSchedulerShutdownTimeout is returned when scheduler graceful shutdown times out.
|
||||
var ErrSchedulerShutdownTimeout = errors.New("scheduler graceful shutdown timeout")
|
||||
|
||||
// acmeGCLoop runs every acmeGCInterval and invokes ACMEGarbageCollector.
|
||||
// Per CLAUDE.md "Scheduler idempotency" architecture decision: an
|
||||
// atomic.Bool guard prevents concurrent tick execution; the
|
||||
// sync.WaitGroup tracks the in-flight goroutine for graceful shutdown.
|
||||
// Phase 5.
|
||||
func (s *Scheduler) acmeGCLoop(ctx context.Context) {
|
||||
ticker := time.NewTicker(s.acmeGCInterval)
|
||||
defer ticker.Stop()
|
||||
|
||||
for {
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
return
|
||||
case <-ticker.C:
|
||||
if !s.acmeGCRunning.CompareAndSwap(false, true) {
|
||||
s.logger.Warn("ACME GC sweep still running, skipping tick")
|
||||
continue
|
||||
}
|
||||
s.wg.Add(1)
|
||||
go func() {
|
||||
defer s.wg.Done()
|
||||
defer s.acmeGCRunning.Store(false)
|
||||
// 1-minute timeout per sweep — the per-statement work is
|
||||
// cheap (single DELETE / UPDATE per sweep, all on indexed
|
||||
// columns), but bound the cycle so a stuck Postgres can't
|
||||
// block the next tick.
|
||||
opCtx, cancel := context.WithTimeout(ctx, time.Minute)
|
||||
defer cancel()
|
||||
if err := s.acmeGC.GarbageCollect(opCtx); err != nil {
|
||||
s.logger.Warn("acme gc sweep failed (next tick will retry)", "error", err)
|
||||
}
|
||||
}()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user