acme-server: cert-manager integration test + production hardening (Phase 5/7)

Closes the production-readiness loop on the ACME surface. After this
commit, certctl ships per-account rate limits + a GC sweeper for
expired ACME state + a kind-driven cert-manager 1.15 integration test
+ a lego-driven RFC conformance harness + a k6 loadtest scenario for
the unauthenticated ACME path.

Architecture:
  - Rate limits live in-memory + per-replica. Restart wipes the
    counters; orders/hour caps are eventual-consistency anyway. A
    3-replica certctl-server fleet behind an LB effectively has 3x
    the configured throughput per account; persistent rate limiting
    is a follow-up if production telemetry shows abuse patterns we
    can't catch in a single restart cycle. Per-key + per-action
    isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and
    ActionChallengeRespond/<challenge-id> are independent buckets.
  - GC loop follows the existing scheduler-loop pattern (atomic.Bool
    + sync.WaitGroup; see crlGenerationLoop for shape). Three
    independent SQL sweeps per tick (DELETE expired nonces; UPDATE
    pending authzs whose expires_at < now() to expired; UPDATE
    pending/ready/processing orders whose expires_at < now() to
    invalid). Each sweep is a single statement; failures are logged-
    and-continued so a failing nonces sweep doesn't block authzs.
    Per-sweep 1m timeout bounds a stuck Postgres.
  - cert-manager integration test is gated on KIND_AVAILABLE so CI
    skips it cleanly (kind is too heavy for per-PR). Operators run
    locally via 'make acme-cert-manager-test'; the harness brings up
    a fresh cluster each run + tears it down on Cleanup.
  - lego conformance harness drives a real ACME client through
    register → run → cert-PEM-landed against a hermetic certctl
    stack. Catches RFC-shape regressions third-party clients would
    hit before they ship.
  - k6 ACME-flow scenario hammers the unauthenticated surface
    (directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS-
    signed flows are out of scope for k6 (no JWS support); they're
    covered by the lego harness above.

What ships:
  - internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases —
    disable-when-perHour-zero, capacity, per-key isolation, per-
    action isolation, refill-over-time, RetryAfter, concurrent-access
    with -race + 200 goroutines × 200 calls).
  - internal/repository/postgres/acme.go: 4 new methods —
    CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations
    + GCInvalidateExpiredOrders. Each a single SQL statement.
  - internal/service/acme.go: SetRateLimiter + GarbageCollect +
    rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey
    + RespondToChallenge) + concurrent-orders gate at CreateOrder.
    2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded);
    5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped /
    gc_authzs_expired / gc_orders_invalidated).
  - internal/scheduler/scheduler.go: ACMEGarbageCollector interface +
    acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME-
    GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the
    crlGenerationLoop shape.
  - internal/api/handler/acme.go: writeServiceError gains rateLimited
    (429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings.
  - internal/config/config.go: 5 new env vars
    (CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60,
    CERTCTL_ACME_SERVER_GC_INTERVAL=1m).
  - cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at
    startup; conditional SetACMEGarbageCollector(acmeService) +
    SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+
    GCInterval > 0.
  - deploy/test/acme-integration/: kind-config.yaml + cert-manager-
    install.sh + clusterissuer-trust-authenticated.yaml +
    clusterissuer-challenge.yaml + certificate-test.yaml + conformance-
    lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE
    gate).
  - deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section.
  - Makefile: 2 new PHONY targets (acme-cert-manager-test +
    acme-rfc-conformance-test).
  - docs/acme-server.md: status flipped to Phase 5; Configuration
    table grows 5 rows; new 'Phase 5 — operational guidance' section
    explaining rate-limit math + GC sweeper semantics + cert-manager
    integration + lego conformance + k6 baseline.

Tests:
  - 'go vet ./...' clean across the repo.
  - 'go test -short -count=1 ./internal/...' green across every
    affected package (service / acme / handler / scheduler / repo /
    config).
  - 'go vet -tags=integration ./deploy/test/acme-integration/' clean
    (the integration test compiles cleanly with the build tag).
  - The kind/cert-manager harness is gated behind KIND_AVAILABLE so
    CI skips by default; operators run locally via 'make acme-cert-
    manager-test'.

Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
This commit is contained in:
shankar0123
2026-05-03 19:42:03 +00:00
parent 9bfbac0f97
commit bee47f0318
20 changed files with 1341 additions and 21 deletions
+147
View File
@@ -56,6 +56,20 @@ type ACMERepo interface {
// Phase 4 — key rollover + revocation auth.
UpdateAccountJWKWithTx(ctx context.Context, q repository.Querier, accountID, expectedOldThumbprint, newThumbprint, newJWKPEM string) error
AccountOwnsCertificate(ctx context.Context, accountID, certificateID string) (bool, error)
// Phase 5 — per-account concurrent-order count + GC sweeps.
// CountActiveOrdersByAccount returns the number of orders in
// pending/ready/processing for the given account.
CountActiveOrdersByAccount(ctx context.Context, accountID string) (int, error)
// GCExpiredNonces deletes nonces whose expires_at < now() OR
// used = true. Returns rows-affected count for telemetry.
GCExpiredNonces(ctx context.Context) (int64, error)
// GCExpireAuthorizations transitions authzs in `pending` whose
// expires_at < now() to `expired`. Returns rows-affected count.
GCExpireAuthorizations(ctx context.Context) (int64, error)
// GCInvalidateExpiredOrders transitions orders in
// pending/ready/processing whose expires_at < now() to `invalid`
// with a server-internal error. Returns rows-affected count.
GCInvalidateExpiredOrders(ctx context.Context) (int64, error)
}
// CertificateRevoker is the minimum surface ACMEService needs to route
@@ -142,6 +156,12 @@ type ACMEService struct {
// and RenewalInfo returns the no-policy default window.
revoker CertificateRevoker
renewalPolicies RenewalPolicyLookup
// Phase 5 — per-account rate limiter. cmd/server/main.go constructs
// an *acme.RateLimiter and wires it via SetRateLimiter. When unset
// (tests, legacy bootstrap) the limiter calls short-circuit to
// "always allow" — same shape as the validatorPool unset case.
rateLimiter *acme.RateLimiter
}
// NewACMEService constructs an ACMEService with the directory + nonce
@@ -196,6 +216,16 @@ func (s *ACMEService) SetRevocationDelegate(r CertificateRevoker) { s.revoker =
// still returns 200.
func (s *ACMEService) SetRenewalPolicyLookup(r RenewalPolicyLookup) { s.renewalPolicies = r }
// SetRateLimiter wires Phase 5's per-account rate limiter. Optional —
// when nil, the per-action rate-limit checks short-circuit to
// "always allow" so the legacy code path stays unchanged for bootstrap
// + tests that don't care about throttling.
func (s *ACMEService) SetRateLimiter(r *acme.RateLimiter) { s.rateLimiter = r }
// RateLimiter returns the wired limiter so the handler can compute
// Retry-After durations on rate-limited responses without re-checking.
func (s *ACMEService) RateLimiter() *acme.RateLimiter { return s.rateLimiter }
// SetValidatorPool wires Phase 3's challenge validator pool.
// cmd/server/main.go constructs an *acme.Pool at startup with the
// per-type concurrency caps from cfg.ACMEServer. Optional —
@@ -334,6 +364,19 @@ var ErrACMEARIDisabled = errors.New("acme: ARI is disabled on this server")
// not RFC 9773 §4.1 shape. Handler maps to 400 + malformed.
var ErrACMEARIBadCertID = errors.New("acme: ARI cert-id is malformed")
// Phase 5 sentinels.
// ErrACMERateLimited is returned when the per-action rate limit fires.
// Handler maps to RFC 7807 + RFC 8555 §6.7
// `urn:ietf:params:acme:error:rateLimited` with a Retry-After header.
var ErrACMERateLimited = errors.New("acme: rate limit exceeded")
// ErrACMEConcurrentOrdersExceeded is returned by CreateOrder when the
// account already has cfg.RateLimitConcurrentOrders orders in
// pending/ready/processing. Handler maps to rateLimited (RFC 8555 §6.7
// shape; the certctl-side cause is concurrency rather than per-hour).
var ErrACMEConcurrentOrdersExceeded = errors.New("acme: concurrent orders limit exceeded")
// BuildDirectory constructs the per-profile directory document.
//
// profileID resolution:
@@ -463,6 +506,13 @@ type ACMEMetrics struct {
RevokeCertFailTotal atomic.Uint64 // rejected revocation (4xx)
RenewalInfoTotal atomic.Uint64 // ARI 200
RenewalInfoFailTotal atomic.Uint64 // ARI 4xx
// Phase 5 — GC sweep counts (per-tick rows-affected, summed).
GCNoncesReapedTotal atomic.Uint64
GCAuthzsExpiredTotal atomic.Uint64
GCOrdersInvalidatedTotal atomic.Uint64
GCRunsTotal atomic.Uint64
GCRunFailuresTotal atomic.Uint64
}
// NewACMEMetrics returns a zeroed counter table. Concurrent callers
@@ -508,6 +558,11 @@ func (m *ACMEMetrics) Snapshot() map[string]uint64 {
"certctl_acme_revoke_cert_failures_total": m.RevokeCertFailTotal.Load(),
"certctl_acme_renewal_info_total": m.RenewalInfoTotal.Load(),
"certctl_acme_renewal_info_failures_total": m.RenewalInfoFailTotal.Load(),
"certctl_acme_gc_nonces_reaped_total": m.GCNoncesReapedTotal.Load(),
"certctl_acme_gc_authzs_expired_total": m.GCAuthzsExpiredTotal.Load(),
"certctl_acme_gc_orders_invalidated_total": m.GCOrdersInvalidatedTotal.Load(),
"certctl_acme_gc_runs_total": m.GCRunsTotal.Load(),
"certctl_acme_gc_run_failures_total": m.GCRunFailuresTotal.Load(),
}
}
@@ -783,6 +838,27 @@ func (s *ACMEService) CreateOrder(
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
return nil, fmt.Errorf("acme: new-order requires SetTransactor + SetAuditService")
}
// Phase 5 — per-account orders/hour cap. Hits return rateLimited
// (RFC 8555 §6.7) before any DB work. Counter is in-memory; restart
// wipes (eventual-consistency caps are acceptable).
if s.rateLimiter != nil && s.cfg.RateLimitOrdersPerHour > 0 {
if !s.rateLimiter.Allow(acme.ActionNewOrder, accountID, s.cfg.RateLimitOrdersPerHour) {
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
return nil, ErrACMERateLimited
}
}
// Phase 5 — concurrent-orders cap. We count
// pending/ready/processing orders for this account; if at-or-over
// the cap, reject. This is a DB read (no FOR UPDATE), so two
// requests racing under the threshold can both succeed and push
// the account one over — accepted as eventual-consistency.
if s.cfg.RateLimitConcurrentOrders > 0 {
count, cerr := s.repo.CountActiveOrdersByAccount(ctx, accountID)
if cerr == nil && count >= s.cfg.RateLimitConcurrentOrders {
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
return nil, ErrACMEConcurrentOrdersExceeded
}
}
resolvedProfileID, err := s.resolveProfile(ctx, profileID)
if err != nil {
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
@@ -1285,6 +1361,16 @@ func (s *ACMEService) RespondToChallenge(
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
return nil, ErrACMEChallengePoolUnconfigured
}
// Phase 5 — per-challenge respond rate limit. Defends against retry
// storms from a misbehaving client. Keyed by challengeID (not
// accountID) so a flood against one challenge doesn't drain the
// account's whole budget.
if s.rateLimiter != nil && s.cfg.RateLimitChallengeRespondsPerHour > 0 {
if !s.rateLimiter.Allow(acme.ActionChallengeRespond, challengeID, s.cfg.RateLimitChallengeRespondsPerHour) {
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
return nil, ErrACMERateLimited
}
}
ch, err := s.repo.GetChallengeByID(ctx, challengeID)
if err != nil {
@@ -1508,6 +1594,14 @@ func (s *ACMEService) RotateAccountKey(
s.metrics.bump(&s.metrics.KeyChangeFailTotal)
return nil, ErrACMEKeyRolloverInvalid
}
// Phase 5 — rollovers/hour cap. Defaults to 5/hour: a flood is an
// attack signal (key rotation should be rare). Keyed by accountID.
if s.rateLimiter != nil && s.cfg.RateLimitKeyChangePerHour > 0 {
if !s.rateLimiter.Allow(acme.ActionKeyChange, oldAccount.AccountID, s.cfg.RateLimitKeyChangePerHour) {
s.metrics.bump(&s.metrics.KeyChangeFailTotal)
return nil, ErrACMERateLimited
}
}
newThumbprint, err := acme.JWKThumbprint(newJWK)
if err != nil {
@@ -1816,3 +1910,56 @@ func mapACMERevocationReason(code int) string {
return string(domain.RevocationReasonUnspecified)
}
}
// GarbageCollect runs a single ACME GC sweep. Phase 5 — the scheduler
// invokes this every cfg.GCInterval. Three independent sweeps:
//
// 1. Delete used / expired nonces.
// 2. Transition expired pending authzs to `expired`.
// 3. Transition expired pending/ready/processing orders to `invalid`.
//
// Each sweep is a single SQL statement (no per-row transactions) so a
// large reap is one atomic write per sweep. Per-sweep errors are
// logged-and-continued: a failing nonces sweep doesn't block the
// authzs sweep. Returns the first error encountered (for caller
// telemetry); per-sweep counts are recorded on metrics regardless.
//
// Idempotent — repeated runs are safe; the second run finds 0 rows.
func (s *ACMEService) GarbageCollect(ctx context.Context) error {
s.metrics.bump(&s.metrics.GCRunsTotal)
var firstErr error
if n, err := s.repo.GCExpiredNonces(ctx); err != nil {
s.metrics.bump(&s.metrics.GCRunFailuresTotal)
if firstErr == nil {
firstErr = fmt.Errorf("acme gc: nonces: %w", err)
}
} else if n > 0 {
atomicAddUint64(&s.metrics.GCNoncesReapedTotal, uint64(n))
}
if n, err := s.repo.GCExpireAuthorizations(ctx); err != nil {
s.metrics.bump(&s.metrics.GCRunFailuresTotal)
if firstErr == nil {
firstErr = fmt.Errorf("acme gc: authzs: %w", err)
}
} else if n > 0 {
atomicAddUint64(&s.metrics.GCAuthzsExpiredTotal, uint64(n))
}
if n, err := s.repo.GCInvalidateExpiredOrders(ctx); err != nil {
s.metrics.bump(&s.metrics.GCRunFailuresTotal)
if firstErr == nil {
firstErr = fmt.Errorf("acme gc: orders: %w", err)
}
} else if n > 0 {
atomicAddUint64(&s.metrics.GCOrdersInvalidatedTotal, uint64(n))
}
return firstErr
}
// atomicAddUint64 adds delta to the counter. The metrics struct exposes
// only `bump` (add 1) by default; this helper covers the
// rows-affected-N case the GC needs.
func atomicAddUint64(c *atomic.Uint64, delta uint64) { c.Add(delta) }
+17
View File
@@ -164,6 +164,23 @@ func (f *fakeACMERepo) UpdateAccountJWKWithTx(ctx context.Context, q repository.
func (f *fakeACMERepo) AccountOwnsCertificate(ctx context.Context, accountID, certificateID string) (bool, error) {
return false, nil
}
func (f *fakeACMERepo) CountActiveOrdersByAccount(ctx context.Context, accountID string) (int, error) {
return 0, nil
}
func (f *fakeACMERepo) GCExpiredNonces(ctx context.Context) (int64, error) {
n := int64(0)
for nonce, exp := range f.issued {
if time.Now().After(exp) {
delete(f.issued, nonce)
n++
}
}
return n, nil
}
func (f *fakeACMERepo) GCExpireAuthorizations(ctx context.Context) (int64, error) { return 0, nil }
func (f *fakeACMERepo) GCInvalidateExpiredOrders(ctx context.Context) (int64, error) {
return 0, nil
}
// fakeTransactor is the repository.Transactor stand-in: runs fn
// against the supplied querier (we just pass nil — fakes ignore it).