mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 13:51:36 +00:00
acme-server: cert-manager integration test + production hardening (Phase 5/7)
Closes the production-readiness loop on the ACME surface. After this
commit, certctl ships per-account rate limits + a GC sweeper for
expired ACME state + a kind-driven cert-manager 1.15 integration test
+ a lego-driven RFC conformance harness + a k6 loadtest scenario for
the unauthenticated ACME path.
Architecture:
- Rate limits live in-memory + per-replica. Restart wipes the
counters; orders/hour caps are eventual-consistency anyway. A
3-replica certctl-server fleet behind an LB effectively has 3x
the configured throughput per account; persistent rate limiting
is a follow-up if production telemetry shows abuse patterns we
can't catch in a single restart cycle. Per-key + per-action
isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and
ActionChallengeRespond/<challenge-id> are independent buckets.
- GC loop follows the existing scheduler-loop pattern (atomic.Bool
+ sync.WaitGroup; see crlGenerationLoop for shape). Three
independent SQL sweeps per tick (DELETE expired nonces; UPDATE
pending authzs whose expires_at < now() to expired; UPDATE
pending/ready/processing orders whose expires_at < now() to
invalid). Each sweep is a single statement; failures are logged-
and-continued so a failing nonces sweep doesn't block authzs.
Per-sweep 1m timeout bounds a stuck Postgres.
- cert-manager integration test is gated on KIND_AVAILABLE so CI
skips it cleanly (kind is too heavy for per-PR). Operators run
locally via 'make acme-cert-manager-test'; the harness brings up
a fresh cluster each run + tears it down on Cleanup.
- lego conformance harness drives a real ACME client through
register → run → cert-PEM-landed against a hermetic certctl
stack. Catches RFC-shape regressions third-party clients would
hit before they ship.
- k6 ACME-flow scenario hammers the unauthenticated surface
(directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS-
signed flows are out of scope for k6 (no JWS support); they're
covered by the lego harness above.
What ships:
- internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases —
disable-when-perHour-zero, capacity, per-key isolation, per-
action isolation, refill-over-time, RetryAfter, concurrent-access
with -race + 200 goroutines × 200 calls).
- internal/repository/postgres/acme.go: 4 new methods —
CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations
+ GCInvalidateExpiredOrders. Each a single SQL statement.
- internal/service/acme.go: SetRateLimiter + GarbageCollect +
rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey
+ RespondToChallenge) + concurrent-orders gate at CreateOrder.
2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded);
5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped /
gc_authzs_expired / gc_orders_invalidated).
- internal/scheduler/scheduler.go: ACMEGarbageCollector interface +
acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME-
GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the
crlGenerationLoop shape.
- internal/api/handler/acme.go: writeServiceError gains rateLimited
(429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings.
- internal/config/config.go: 5 new env vars
(CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100,
CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5,
CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5,
CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60,
CERTCTL_ACME_SERVER_GC_INTERVAL=1m).
- cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at
startup; conditional SetACMEGarbageCollector(acmeService) +
SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+
GCInterval > 0.
- deploy/test/acme-integration/: kind-config.yaml + cert-manager-
install.sh + clusterissuer-trust-authenticated.yaml +
clusterissuer-challenge.yaml + certificate-test.yaml + conformance-
lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE
gate).
- deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section.
- Makefile: 2 new PHONY targets (acme-cert-manager-test +
acme-rfc-conformance-test).
- docs/acme-server.md: status flipped to Phase 5; Configuration
table grows 5 rows; new 'Phase 5 — operational guidance' section
explaining rate-limit math + GC sweeper semantics + cert-manager
integration + lego conformance + k6 baseline.
Tests:
- 'go vet ./...' clean across the repo.
- 'go test -short -count=1 ./internal/...' green across every
affected package (service / acme / handler / scheduler / repo /
config).
- 'go vet -tags=integration ./deploy/test/acme-integration/' clean
(the integration test compiles cleanly with the build tag).
- The kind/cert-manager harness is gated behind KIND_AVAILABLE so
CI skips by default; operators run locally via 'make acme-cert-
manager-test'.
Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
This commit is contained in:
@@ -56,6 +56,20 @@ type ACMERepo interface {
|
||||
// Phase 4 — key rollover + revocation auth.
|
||||
UpdateAccountJWKWithTx(ctx context.Context, q repository.Querier, accountID, expectedOldThumbprint, newThumbprint, newJWKPEM string) error
|
||||
AccountOwnsCertificate(ctx context.Context, accountID, certificateID string) (bool, error)
|
||||
// Phase 5 — per-account concurrent-order count + GC sweeps.
|
||||
// CountActiveOrdersByAccount returns the number of orders in
|
||||
// pending/ready/processing for the given account.
|
||||
CountActiveOrdersByAccount(ctx context.Context, accountID string) (int, error)
|
||||
// GCExpiredNonces deletes nonces whose expires_at < now() OR
|
||||
// used = true. Returns rows-affected count for telemetry.
|
||||
GCExpiredNonces(ctx context.Context) (int64, error)
|
||||
// GCExpireAuthorizations transitions authzs in `pending` whose
|
||||
// expires_at < now() to `expired`. Returns rows-affected count.
|
||||
GCExpireAuthorizations(ctx context.Context) (int64, error)
|
||||
// GCInvalidateExpiredOrders transitions orders in
|
||||
// pending/ready/processing whose expires_at < now() to `invalid`
|
||||
// with a server-internal error. Returns rows-affected count.
|
||||
GCInvalidateExpiredOrders(ctx context.Context) (int64, error)
|
||||
}
|
||||
|
||||
// CertificateRevoker is the minimum surface ACMEService needs to route
|
||||
@@ -142,6 +156,12 @@ type ACMEService struct {
|
||||
// and RenewalInfo returns the no-policy default window.
|
||||
revoker CertificateRevoker
|
||||
renewalPolicies RenewalPolicyLookup
|
||||
|
||||
// Phase 5 — per-account rate limiter. cmd/server/main.go constructs
|
||||
// an *acme.RateLimiter and wires it via SetRateLimiter. When unset
|
||||
// (tests, legacy bootstrap) the limiter calls short-circuit to
|
||||
// "always allow" — same shape as the validatorPool unset case.
|
||||
rateLimiter *acme.RateLimiter
|
||||
}
|
||||
|
||||
// NewACMEService constructs an ACMEService with the directory + nonce
|
||||
@@ -196,6 +216,16 @@ func (s *ACMEService) SetRevocationDelegate(r CertificateRevoker) { s.revoker =
|
||||
// still returns 200.
|
||||
func (s *ACMEService) SetRenewalPolicyLookup(r RenewalPolicyLookup) { s.renewalPolicies = r }
|
||||
|
||||
// SetRateLimiter wires Phase 5's per-account rate limiter. Optional —
|
||||
// when nil, the per-action rate-limit checks short-circuit to
|
||||
// "always allow" so the legacy code path stays unchanged for bootstrap
|
||||
// + tests that don't care about throttling.
|
||||
func (s *ACMEService) SetRateLimiter(r *acme.RateLimiter) { s.rateLimiter = r }
|
||||
|
||||
// RateLimiter returns the wired limiter so the handler can compute
|
||||
// Retry-After durations on rate-limited responses without re-checking.
|
||||
func (s *ACMEService) RateLimiter() *acme.RateLimiter { return s.rateLimiter }
|
||||
|
||||
// SetValidatorPool wires Phase 3's challenge validator pool.
|
||||
// cmd/server/main.go constructs an *acme.Pool at startup with the
|
||||
// per-type concurrency caps from cfg.ACMEServer. Optional —
|
||||
@@ -334,6 +364,19 @@ var ErrACMEARIDisabled = errors.New("acme: ARI is disabled on this server")
|
||||
// not RFC 9773 §4.1 shape. Handler maps to 400 + malformed.
|
||||
var ErrACMEARIBadCertID = errors.New("acme: ARI cert-id is malformed")
|
||||
|
||||
// Phase 5 sentinels.
|
||||
|
||||
// ErrACMERateLimited is returned when the per-action rate limit fires.
|
||||
// Handler maps to RFC 7807 + RFC 8555 §6.7
|
||||
// `urn:ietf:params:acme:error:rateLimited` with a Retry-After header.
|
||||
var ErrACMERateLimited = errors.New("acme: rate limit exceeded")
|
||||
|
||||
// ErrACMEConcurrentOrdersExceeded is returned by CreateOrder when the
|
||||
// account already has cfg.RateLimitConcurrentOrders orders in
|
||||
// pending/ready/processing. Handler maps to rateLimited (RFC 8555 §6.7
|
||||
// shape; the certctl-side cause is concurrency rather than per-hour).
|
||||
var ErrACMEConcurrentOrdersExceeded = errors.New("acme: concurrent orders limit exceeded")
|
||||
|
||||
// BuildDirectory constructs the per-profile directory document.
|
||||
//
|
||||
// profileID resolution:
|
||||
@@ -463,6 +506,13 @@ type ACMEMetrics struct {
|
||||
RevokeCertFailTotal atomic.Uint64 // rejected revocation (4xx)
|
||||
RenewalInfoTotal atomic.Uint64 // ARI 200
|
||||
RenewalInfoFailTotal atomic.Uint64 // ARI 4xx
|
||||
|
||||
// Phase 5 — GC sweep counts (per-tick rows-affected, summed).
|
||||
GCNoncesReapedTotal atomic.Uint64
|
||||
GCAuthzsExpiredTotal atomic.Uint64
|
||||
GCOrdersInvalidatedTotal atomic.Uint64
|
||||
GCRunsTotal atomic.Uint64
|
||||
GCRunFailuresTotal atomic.Uint64
|
||||
}
|
||||
|
||||
// NewACMEMetrics returns a zeroed counter table. Concurrent callers
|
||||
@@ -508,6 +558,11 @@ func (m *ACMEMetrics) Snapshot() map[string]uint64 {
|
||||
"certctl_acme_revoke_cert_failures_total": m.RevokeCertFailTotal.Load(),
|
||||
"certctl_acme_renewal_info_total": m.RenewalInfoTotal.Load(),
|
||||
"certctl_acme_renewal_info_failures_total": m.RenewalInfoFailTotal.Load(),
|
||||
"certctl_acme_gc_nonces_reaped_total": m.GCNoncesReapedTotal.Load(),
|
||||
"certctl_acme_gc_authzs_expired_total": m.GCAuthzsExpiredTotal.Load(),
|
||||
"certctl_acme_gc_orders_invalidated_total": m.GCOrdersInvalidatedTotal.Load(),
|
||||
"certctl_acme_gc_runs_total": m.GCRunsTotal.Load(),
|
||||
"certctl_acme_gc_run_failures_total": m.GCRunFailuresTotal.Load(),
|
||||
}
|
||||
}
|
||||
|
||||
@@ -783,6 +838,27 @@ func (s *ACMEService) CreateOrder(
|
||||
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
|
||||
return nil, fmt.Errorf("acme: new-order requires SetTransactor + SetAuditService")
|
||||
}
|
||||
// Phase 5 — per-account orders/hour cap. Hits return rateLimited
|
||||
// (RFC 8555 §6.7) before any DB work. Counter is in-memory; restart
|
||||
// wipes (eventual-consistency caps are acceptable).
|
||||
if s.rateLimiter != nil && s.cfg.RateLimitOrdersPerHour > 0 {
|
||||
if !s.rateLimiter.Allow(acme.ActionNewOrder, accountID, s.cfg.RateLimitOrdersPerHour) {
|
||||
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
|
||||
return nil, ErrACMERateLimited
|
||||
}
|
||||
}
|
||||
// Phase 5 — concurrent-orders cap. We count
|
||||
// pending/ready/processing orders for this account; if at-or-over
|
||||
// the cap, reject. This is a DB read (no FOR UPDATE), so two
|
||||
// requests racing under the threshold can both succeed and push
|
||||
// the account one over — accepted as eventual-consistency.
|
||||
if s.cfg.RateLimitConcurrentOrders > 0 {
|
||||
count, cerr := s.repo.CountActiveOrdersByAccount(ctx, accountID)
|
||||
if cerr == nil && count >= s.cfg.RateLimitConcurrentOrders {
|
||||
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
|
||||
return nil, ErrACMEConcurrentOrdersExceeded
|
||||
}
|
||||
}
|
||||
resolvedProfileID, err := s.resolveProfile(ctx, profileID)
|
||||
if err != nil {
|
||||
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
|
||||
@@ -1285,6 +1361,16 @@ func (s *ACMEService) RespondToChallenge(
|
||||
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
|
||||
return nil, ErrACMEChallengePoolUnconfigured
|
||||
}
|
||||
// Phase 5 — per-challenge respond rate limit. Defends against retry
|
||||
// storms from a misbehaving client. Keyed by challengeID (not
|
||||
// accountID) so a flood against one challenge doesn't drain the
|
||||
// account's whole budget.
|
||||
if s.rateLimiter != nil && s.cfg.RateLimitChallengeRespondsPerHour > 0 {
|
||||
if !s.rateLimiter.Allow(acme.ActionChallengeRespond, challengeID, s.cfg.RateLimitChallengeRespondsPerHour) {
|
||||
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
|
||||
return nil, ErrACMERateLimited
|
||||
}
|
||||
}
|
||||
|
||||
ch, err := s.repo.GetChallengeByID(ctx, challengeID)
|
||||
if err != nil {
|
||||
@@ -1508,6 +1594,14 @@ func (s *ACMEService) RotateAccountKey(
|
||||
s.metrics.bump(&s.metrics.KeyChangeFailTotal)
|
||||
return nil, ErrACMEKeyRolloverInvalid
|
||||
}
|
||||
// Phase 5 — rollovers/hour cap. Defaults to 5/hour: a flood is an
|
||||
// attack signal (key rotation should be rare). Keyed by accountID.
|
||||
if s.rateLimiter != nil && s.cfg.RateLimitKeyChangePerHour > 0 {
|
||||
if !s.rateLimiter.Allow(acme.ActionKeyChange, oldAccount.AccountID, s.cfg.RateLimitKeyChangePerHour) {
|
||||
s.metrics.bump(&s.metrics.KeyChangeFailTotal)
|
||||
return nil, ErrACMERateLimited
|
||||
}
|
||||
}
|
||||
|
||||
newThumbprint, err := acme.JWKThumbprint(newJWK)
|
||||
if err != nil {
|
||||
@@ -1816,3 +1910,56 @@ func mapACMERevocationReason(code int) string {
|
||||
return string(domain.RevocationReasonUnspecified)
|
||||
}
|
||||
}
|
||||
|
||||
// GarbageCollect runs a single ACME GC sweep. Phase 5 — the scheduler
|
||||
// invokes this every cfg.GCInterval. Three independent sweeps:
|
||||
//
|
||||
// 1. Delete used / expired nonces.
|
||||
// 2. Transition expired pending authzs to `expired`.
|
||||
// 3. Transition expired pending/ready/processing orders to `invalid`.
|
||||
//
|
||||
// Each sweep is a single SQL statement (no per-row transactions) so a
|
||||
// large reap is one atomic write per sweep. Per-sweep errors are
|
||||
// logged-and-continued: a failing nonces sweep doesn't block the
|
||||
// authzs sweep. Returns the first error encountered (for caller
|
||||
// telemetry); per-sweep counts are recorded on metrics regardless.
|
||||
//
|
||||
// Idempotent — repeated runs are safe; the second run finds 0 rows.
|
||||
func (s *ACMEService) GarbageCollect(ctx context.Context) error {
|
||||
s.metrics.bump(&s.metrics.GCRunsTotal)
|
||||
var firstErr error
|
||||
|
||||
if n, err := s.repo.GCExpiredNonces(ctx); err != nil {
|
||||
s.metrics.bump(&s.metrics.GCRunFailuresTotal)
|
||||
if firstErr == nil {
|
||||
firstErr = fmt.Errorf("acme gc: nonces: %w", err)
|
||||
}
|
||||
} else if n > 0 {
|
||||
atomicAddUint64(&s.metrics.GCNoncesReapedTotal, uint64(n))
|
||||
}
|
||||
|
||||
if n, err := s.repo.GCExpireAuthorizations(ctx); err != nil {
|
||||
s.metrics.bump(&s.metrics.GCRunFailuresTotal)
|
||||
if firstErr == nil {
|
||||
firstErr = fmt.Errorf("acme gc: authzs: %w", err)
|
||||
}
|
||||
} else if n > 0 {
|
||||
atomicAddUint64(&s.metrics.GCAuthzsExpiredTotal, uint64(n))
|
||||
}
|
||||
|
||||
if n, err := s.repo.GCInvalidateExpiredOrders(ctx); err != nil {
|
||||
s.metrics.bump(&s.metrics.GCRunFailuresTotal)
|
||||
if firstErr == nil {
|
||||
firstErr = fmt.Errorf("acme gc: orders: %w", err)
|
||||
}
|
||||
} else if n > 0 {
|
||||
atomicAddUint64(&s.metrics.GCOrdersInvalidatedTotal, uint64(n))
|
||||
}
|
||||
|
||||
return firstErr
|
||||
}
|
||||
|
||||
// atomicAddUint64 adds delta to the counter. The metrics struct exposes
|
||||
// only `bump` (add 1) by default; this helper covers the
|
||||
// rows-affected-N case the GC needs.
|
||||
func atomicAddUint64(c *atomic.Uint64, delta uint64) { c.Add(delta) }
|
||||
|
||||
@@ -164,6 +164,23 @@ func (f *fakeACMERepo) UpdateAccountJWKWithTx(ctx context.Context, q repository.
|
||||
func (f *fakeACMERepo) AccountOwnsCertificate(ctx context.Context, accountID, certificateID string) (bool, error) {
|
||||
return false, nil
|
||||
}
|
||||
func (f *fakeACMERepo) CountActiveOrdersByAccount(ctx context.Context, accountID string) (int, error) {
|
||||
return 0, nil
|
||||
}
|
||||
func (f *fakeACMERepo) GCExpiredNonces(ctx context.Context) (int64, error) {
|
||||
n := int64(0)
|
||||
for nonce, exp := range f.issued {
|
||||
if time.Now().After(exp) {
|
||||
delete(f.issued, nonce)
|
||||
n++
|
||||
}
|
||||
}
|
||||
return n, nil
|
||||
}
|
||||
func (f *fakeACMERepo) GCExpireAuthorizations(ctx context.Context) (int64, error) { return 0, nil }
|
||||
func (f *fakeACMERepo) GCInvalidateExpiredOrders(ctx context.Context) (int64, error) {
|
||||
return 0, nil
|
||||
}
|
||||
|
||||
// fakeTransactor is the repository.Transactor stand-in: runs fn
|
||||
// against the supplied querier (we just pass nil — fakes ignore it).
|
||||
|
||||
Reference in New Issue
Block a user