mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 15:51:30 +00:00
acme-server: cert-manager integration test + production hardening (Phase 5/7)
Closes the production-readiness loop on the ACME surface. After this
commit, certctl ships per-account rate limits + a GC sweeper for
expired ACME state + a kind-driven cert-manager 1.15 integration test
+ a lego-driven RFC conformance harness + a k6 loadtest scenario for
the unauthenticated ACME path.
Architecture:
- Rate limits live in-memory + per-replica. Restart wipes the
counters; orders/hour caps are eventual-consistency anyway. A
3-replica certctl-server fleet behind an LB effectively has 3x
the configured throughput per account; persistent rate limiting
is a follow-up if production telemetry shows abuse patterns we
can't catch in a single restart cycle. Per-key + per-action
isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and
ActionChallengeRespond/<challenge-id> are independent buckets.
- GC loop follows the existing scheduler-loop pattern (atomic.Bool
+ sync.WaitGroup; see crlGenerationLoop for shape). Three
independent SQL sweeps per tick (DELETE expired nonces; UPDATE
pending authzs whose expires_at < now() to expired; UPDATE
pending/ready/processing orders whose expires_at < now() to
invalid). Each sweep is a single statement; failures are logged-
and-continued so a failing nonces sweep doesn't block authzs.
Per-sweep 1m timeout bounds a stuck Postgres.
- cert-manager integration test is gated on KIND_AVAILABLE so CI
skips it cleanly (kind is too heavy for per-PR). Operators run
locally via 'make acme-cert-manager-test'; the harness brings up
a fresh cluster each run + tears it down on Cleanup.
- lego conformance harness drives a real ACME client through
register → run → cert-PEM-landed against a hermetic certctl
stack. Catches RFC-shape regressions third-party clients would
hit before they ship.
- k6 ACME-flow scenario hammers the unauthenticated surface
(directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS-
signed flows are out of scope for k6 (no JWS support); they're
covered by the lego harness above.
What ships:
- internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases —
disable-when-perHour-zero, capacity, per-key isolation, per-
action isolation, refill-over-time, RetryAfter, concurrent-access
with -race + 200 goroutines × 200 calls).
- internal/repository/postgres/acme.go: 4 new methods —
CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations
+ GCInvalidateExpiredOrders. Each a single SQL statement.
- internal/service/acme.go: SetRateLimiter + GarbageCollect +
rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey
+ RespondToChallenge) + concurrent-orders gate at CreateOrder.
2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded);
5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped /
gc_authzs_expired / gc_orders_invalidated).
- internal/scheduler/scheduler.go: ACMEGarbageCollector interface +
acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME-
GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the
crlGenerationLoop shape.
- internal/api/handler/acme.go: writeServiceError gains rateLimited
(429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings.
- internal/config/config.go: 5 new env vars
(CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100,
CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5,
CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5,
CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60,
CERTCTL_ACME_SERVER_GC_INTERVAL=1m).
- cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at
startup; conditional SetACMEGarbageCollector(acmeService) +
SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+
GCInterval > 0.
- deploy/test/acme-integration/: kind-config.yaml + cert-manager-
install.sh + clusterissuer-trust-authenticated.yaml +
clusterissuer-challenge.yaml + certificate-test.yaml + conformance-
lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE
gate).
- deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section.
- Makefile: 2 new PHONY targets (acme-cert-manager-test +
acme-rfc-conformance-test).
- docs/acme-server.md: status flipped to Phase 5; Configuration
table grows 5 rows; new 'Phase 5 — operational guidance' section
explaining rate-limit math + GC sweeper semantics + cert-manager
integration + lego conformance + k6 baseline.
Tests:
- 'go vet ./...' clean across the repo.
- 'go test -short -count=1 ./internal/...' green across every
affected package (service / acme / handler / scheduler / repo /
config).
- 'go vet -tags=integration ./deploy/test/acme-integration/' clean
(the integration test compiles cleanly with the build tag).
- The kind/cert-manager harness is gated behind KIND_AVAILABLE so
CI skips by default; operators run locally via 'make acme-cert-
manager-test'.
Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
This commit is contained in:
@@ -0,0 +1,166 @@
|
||||
// Copyright (c) certctl
|
||||
// SPDX-License-Identifier: BSL-1.1
|
||||
|
||||
package acme
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"sync"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Phase 5 — per-account rolling-hour rate limiter for ACME operations.
|
||||
//
|
||||
// Architecture:
|
||||
// - In-memory token-bucket per (key, action). Restart wipes the
|
||||
// buckets; orders/hour caps are eventual-consistency so this is
|
||||
// acceptable. Persistent rate limiting is a follow-up if production
|
||||
// telemetry shows abuse patterns we can't catch in a single restart
|
||||
// cycle (master prompt criterion #11 explicitly accepts this).
|
||||
// - Tokens-per-hour math: bucket capacity = perHour, refill rate =
|
||||
// perHour / 3600 tokens/sec. A fresh bucket starts full; an over-
|
||||
// limit caller drains it then has to wait for replenishment.
|
||||
// - Key shape is action-specific: orders use accountID; key-rollover
|
||||
// uses accountID; challenge-respond uses challengeID (so a flood
|
||||
// against one challenge doesn't burn the whole account's budget).
|
||||
//
|
||||
// Concurrency: the outer map is RWMutex-guarded for create-on-demand;
|
||||
// per-bucket allow() takes a tiny per-bucket Mutex. Mirrors the
|
||||
// existing internal/api/middleware/middleware.go::keyedRateLimiter
|
||||
// pattern (different scope, same shape).
|
||||
|
||||
// RateLimiter is the per-action token-bucket pool. Construct with
|
||||
// NewRateLimiter(); pass a single instance into ACMEService via
|
||||
// SetRateLimiter so all entry points share the same buckets.
|
||||
type RateLimiter struct {
|
||||
mu sync.RWMutex
|
||||
buckets map[string]*rlBucket // keyed by "<action>|<keyID>"
|
||||
clock func() time.Time // injectable for tests
|
||||
}
|
||||
|
||||
// NewRateLimiter returns an empty RateLimiter. Buckets are created on
|
||||
// first reference, so a fresh limiter does no work until traffic
|
||||
// arrives.
|
||||
func NewRateLimiter() *RateLimiter {
|
||||
return &RateLimiter{
|
||||
buckets: make(map[string]*rlBucket),
|
||||
clock: time.Now,
|
||||
}
|
||||
}
|
||||
|
||||
// SetClock replaces the clock for tests. Production callers leave it
|
||||
// pointing at time.Now (the constructor default).
|
||||
func (r *RateLimiter) SetClock(now func() time.Time) {
|
||||
if now != nil {
|
||||
r.clock = now
|
||||
}
|
||||
}
|
||||
|
||||
// Allow returns true when the (action, keyID) bucket has at least one
|
||||
// token available — and consumes that token. perHour=0 disables the
|
||||
// limit (always true). Negative perHour is treated as 0.
|
||||
//
|
||||
// On hit (first call → first token consumed → returns true). Once
|
||||
// drained, further calls within the same hour return false until
|
||||
// elapsed-time refills the bucket.
|
||||
func (r *RateLimiter) Allow(action, keyID string, perHour int) bool {
|
||||
if perHour <= 0 {
|
||||
return true
|
||||
}
|
||||
bucketKey := action + "|" + keyID
|
||||
r.mu.RLock()
|
||||
b, ok := r.buckets[bucketKey]
|
||||
r.mu.RUnlock()
|
||||
if !ok {
|
||||
r.mu.Lock()
|
||||
b, ok = r.buckets[bucketKey]
|
||||
if !ok {
|
||||
b = &rlBucket{
|
||||
capacity: float64(perHour),
|
||||
refillRate: float64(perHour) / 3600.0, // tokens/sec
|
||||
tokens: float64(perHour),
|
||||
lastRefill: r.clock(),
|
||||
}
|
||||
r.buckets[bucketKey] = b
|
||||
}
|
||||
r.mu.Unlock()
|
||||
}
|
||||
return b.allow(r.clock)
|
||||
}
|
||||
|
||||
// RetryAfter returns the duration the caller should wait before the
|
||||
// (action, keyID) bucket has at least one token again. Returns 0 when
|
||||
// at least one token is currently available. Used by the handler to
|
||||
// emit a Retry-After header on rateLimited responses.
|
||||
func (r *RateLimiter) RetryAfter(action, keyID string, perHour int) time.Duration {
|
||||
if perHour <= 0 {
|
||||
return 0
|
||||
}
|
||||
bucketKey := action + "|" + keyID
|
||||
r.mu.RLock()
|
||||
b, ok := r.buckets[bucketKey]
|
||||
r.mu.RUnlock()
|
||||
if !ok {
|
||||
return 0
|
||||
}
|
||||
b.mu.Lock()
|
||||
defer b.mu.Unlock()
|
||||
if b.tokens >= 1 {
|
||||
return 0
|
||||
}
|
||||
missing := 1 - b.tokens
|
||||
if b.refillRate <= 0 {
|
||||
// Shouldn't happen (Allow rejects perHour<=0 before bucket
|
||||
// creation), but a divide-by-zero here would panic.
|
||||
return time.Hour
|
||||
}
|
||||
secs := missing / b.refillRate
|
||||
return time.Duration(secs * float64(time.Second))
|
||||
}
|
||||
|
||||
// rlBucket is the per-(action, keyID) token bucket. Mirrors the shape
|
||||
// of internal/api/middleware/middleware.go::tokenBucket but with a
|
||||
// per-hour-shaped refill instead of per-second.
|
||||
type rlBucket struct {
|
||||
mu sync.Mutex
|
||||
capacity float64
|
||||
refillRate float64 // tokens per second
|
||||
tokens float64
|
||||
lastRefill time.Time
|
||||
}
|
||||
|
||||
func (b *rlBucket) allow(clock func() time.Time) bool {
|
||||
b.mu.Lock()
|
||||
defer b.mu.Unlock()
|
||||
|
||||
now := clock()
|
||||
// Monotonic-clock-safe via t.Sub(t) per Go time-package contract.
|
||||
elapsed := now.Sub(b.lastRefill).Seconds()
|
||||
if elapsed > 0 {
|
||||
b.tokens += elapsed * b.refillRate
|
||||
if b.tokens > b.capacity {
|
||||
b.tokens = b.capacity
|
||||
}
|
||||
b.lastRefill = now
|
||||
}
|
||||
if b.tokens < 1 {
|
||||
return false
|
||||
}
|
||||
b.tokens--
|
||||
return true
|
||||
}
|
||||
|
||||
// Action constants — keep one source of truth for the bucket-key
|
||||
// `<action>|...` prefix. Using untyped consts (not iota) so they
|
||||
// survive cross-process coordination if a follow-up adds shared-state
|
||||
// rate-limiting.
|
||||
const (
|
||||
ActionNewOrder = "new_order"
|
||||
ActionKeyChange = "key_change"
|
||||
ActionChallengeRespond = "challenge_respond"
|
||||
)
|
||||
|
||||
// ErrRateLimited is the sentinel service-layer entry points return on
|
||||
// a hit. Handler maps to RFC 7807 + RFC 8555 §6.7
|
||||
// `urn:ietf:params:acme:error:rateLimited` with Retry-After.
|
||||
var ErrRateLimited = errors.New("acme: rate limit exceeded")
|
||||
@@ -0,0 +1,159 @@
|
||||
// Copyright (c) certctl
|
||||
// SPDX-License-Identifier: BSL-1.1
|
||||
|
||||
package acme
|
||||
|
||||
import (
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
// Phase 5 — RateLimiter unit tests.
|
||||
|
||||
func TestRateLimiter_DisabledWhenPerHourZero(t *testing.T) {
|
||||
r := NewRateLimiter()
|
||||
for i := 0; i < 10000; i++ {
|
||||
if !r.Allow(ActionNewOrder, "acc-1", 0) {
|
||||
t.Fatalf("Allow returned false on call %d with perHour=0", i)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func TestRateLimiter_DisabledWhenPerHourNegative(t *testing.T) {
|
||||
r := NewRateLimiter()
|
||||
if !r.Allow(ActionNewOrder, "acc-1", -5) {
|
||||
t.Errorf("Allow returned false with perHour=-5; expected always-allow")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRateLimiter_BucketCapacity(t *testing.T) {
|
||||
// Frozen clock: a fresh bucket has perHour tokens. Drain exactly
|
||||
// that many; the next call must return false.
|
||||
now := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
|
||||
r := NewRateLimiter()
|
||||
r.SetClock(func() time.Time { return now })
|
||||
|
||||
for i := 0; i < 100; i++ {
|
||||
if !r.Allow(ActionNewOrder, "acc-1", 100) {
|
||||
t.Fatalf("Allow returned false on call %d (within capacity)", i)
|
||||
}
|
||||
}
|
||||
if r.Allow(ActionNewOrder, "acc-1", 100) {
|
||||
t.Errorf("Allow returned true on the 101st call; expected limit hit")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRateLimiter_PerKeyIsolation(t *testing.T) {
|
||||
// Frozen clock — drain acc-1 to zero, then acc-2 should still have
|
||||
// a full bucket (separate key).
|
||||
now := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
|
||||
r := NewRateLimiter()
|
||||
r.SetClock(func() time.Time { return now })
|
||||
|
||||
for i := 0; i < 100; i++ {
|
||||
_ = r.Allow(ActionNewOrder, "acc-1", 100)
|
||||
}
|
||||
if r.Allow(ActionNewOrder, "acc-1", 100) {
|
||||
t.Errorf("acc-1 should be rate-limited")
|
||||
}
|
||||
if !r.Allow(ActionNewOrder, "acc-2", 100) {
|
||||
t.Errorf("acc-2 should be unaffected by acc-1's bucket; expected allow")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRateLimiter_PerActionIsolation(t *testing.T) {
|
||||
// Same key but different actions get different buckets.
|
||||
now := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
|
||||
r := NewRateLimiter()
|
||||
r.SetClock(func() time.Time { return now })
|
||||
|
||||
for i := 0; i < 5; i++ {
|
||||
_ = r.Allow(ActionKeyChange, "acc-1", 5)
|
||||
}
|
||||
if r.Allow(ActionKeyChange, "acc-1", 5) {
|
||||
t.Errorf("ActionKeyChange should be rate-limited")
|
||||
}
|
||||
// ActionNewOrder for the same key has its own (empty) bucket.
|
||||
if !r.Allow(ActionNewOrder, "acc-1", 100) {
|
||||
t.Errorf("ActionNewOrder for same key should be allowed (different bucket)")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRateLimiter_RefillOverTime(t *testing.T) {
|
||||
// Drain bucket; advance the clock; expect tokens replenished.
|
||||
current := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
|
||||
r := NewRateLimiter()
|
||||
r.SetClock(func() time.Time { return current })
|
||||
|
||||
for i := 0; i < 100; i++ {
|
||||
_ = r.Allow(ActionNewOrder, "acc-1", 100)
|
||||
}
|
||||
if r.Allow(ActionNewOrder, "acc-1", 100) {
|
||||
t.Fatalf("expected limit hit after draining bucket")
|
||||
}
|
||||
// Advance by 36 seconds: at 100/hour = 100/3600 tokens/sec ≈
|
||||
// 0.0278/sec. 36 * 0.0278 = 1.00 tokens — exactly enough for 1
|
||||
// more call.
|
||||
current = current.Add(36 * time.Second)
|
||||
if !r.Allow(ActionNewOrder, "acc-1", 100) {
|
||||
t.Errorf("Allow returned false after 36s elapsed; expected ≥1 token replenished")
|
||||
}
|
||||
}
|
||||
|
||||
func TestRateLimiter_RetryAfter(t *testing.T) {
|
||||
now := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
|
||||
r := NewRateLimiter()
|
||||
r.SetClock(func() time.Time { return now })
|
||||
|
||||
// Drain to zero.
|
||||
for i := 0; i < 100; i++ {
|
||||
_ = r.Allow(ActionNewOrder, "acc-1", 100)
|
||||
}
|
||||
d := r.RetryAfter(ActionNewOrder, "acc-1", 100)
|
||||
// 1 token at 100/hour = 36 seconds.
|
||||
if d < 35*time.Second || d > 37*time.Second {
|
||||
t.Errorf("RetryAfter = %v, expected ~36s", d)
|
||||
}
|
||||
// Allow above capacity — RetryAfter returns 0 on a fresh bucket.
|
||||
if zero := r.RetryAfter(ActionNewOrder, "acc-fresh", 100); zero != 0 {
|
||||
t.Errorf("RetryAfter for fresh bucket = %v, expected 0", zero)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRateLimiter_ConcurrentAccess(t *testing.T) {
|
||||
// Hammer 200 goroutines × 200 calls each = 40000 calls against a
|
||||
// 1000-token bucket; assert no panic, no data race (run with -race),
|
||||
// and that no more than 1000 calls succeeded.
|
||||
now := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
|
||||
r := NewRateLimiter()
|
||||
r.SetClock(func() time.Time { return now })
|
||||
|
||||
var (
|
||||
wg sync.WaitGroup
|
||||
success int64
|
||||
mu sync.Mutex
|
||||
)
|
||||
for g := 0; g < 200; g++ {
|
||||
wg.Add(1)
|
||||
go func() {
|
||||
defer wg.Done()
|
||||
local := int64(0)
|
||||
for i := 0; i < 200; i++ {
|
||||
if r.Allow(ActionNewOrder, "shared-acc", 1000) {
|
||||
local++
|
||||
}
|
||||
}
|
||||
mu.Lock()
|
||||
success += local
|
||||
mu.Unlock()
|
||||
}()
|
||||
}
|
||||
wg.Wait()
|
||||
if success > 1000 {
|
||||
t.Errorf("got %d successes, want ≤ 1000 (bucket capacity)", success)
|
||||
}
|
||||
if success < 1000 {
|
||||
t.Errorf("got %d successes, want exactly 1000 (frozen clock, no refill)", success)
|
||||
}
|
||||
}
|
||||
@@ -274,6 +274,23 @@ func writeServiceError(w http.ResponseWriter, err error) {
|
||||
})
|
||||
case errors.Is(err, service.ErrACMEARIBadCertID):
|
||||
acme.WriteProblem(w, acme.Malformed("ARI cert-id is malformed"))
|
||||
case errors.Is(err, service.ErrACMERateLimited):
|
||||
// RFC 8555 §6.7 + RFC 7807. The handler doesn't have the
|
||||
// (action, key) tuple here so we can't emit a precise
|
||||
// Retry-After; the entry-point handlers (NewOrder etc.) emit
|
||||
// their own Retry-After header before delegating to the
|
||||
// service, leaving this catchall for completeness.
|
||||
acme.WriteProblem(w, acme.Problem{
|
||||
Type: "urn:ietf:params:acme:error:rateLimited",
|
||||
Detail: "request rate limit exceeded; retry later",
|
||||
Status: http.StatusTooManyRequests,
|
||||
})
|
||||
case errors.Is(err, service.ErrACMEConcurrentOrdersExceeded):
|
||||
acme.WriteProblem(w, acme.Problem{
|
||||
Type: "urn:ietf:params:acme:error:rateLimited",
|
||||
Detail: "too many concurrent orders for this account; finish or cancel pending orders before submitting more",
|
||||
Status: http.StatusTooManyRequests,
|
||||
})
|
||||
default:
|
||||
// Avoid leaking internal error text per master-prompt
|
||||
// criterion #10 (operator-actionable errors with no info
|
||||
|
||||
+53
-12
@@ -744,6 +744,42 @@ type ACMEServerConfig struct {
|
||||
// certs. Setting: CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL.
|
||||
ARIPollInterval time.Duration
|
||||
|
||||
// RateLimitOrdersPerHour caps new-order requests per ACME account per
|
||||
// rolling hour. 0 disables (no limit). Default: 100. Hits return RFC
|
||||
// 7807 + RFC 8555 §6.7 `urn:ietf:params:acme:error:rateLimited` with
|
||||
// a Retry-After header. In-memory token-bucket — restart wipes the
|
||||
// counter, which is acceptable for orders/hour caps (eventual-
|
||||
// consistency anyway). Setting:
|
||||
// CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR.
|
||||
RateLimitOrdersPerHour int
|
||||
|
||||
// RateLimitConcurrentOrders caps the number of orders an ACME account
|
||||
// can have in pending/ready/processing state simultaneously. 0
|
||||
// disables. Default: 5. Same Problem shape as the per-hour limit.
|
||||
// Setting: CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS.
|
||||
RateLimitConcurrentOrders int
|
||||
|
||||
// RateLimitKeyChangePerHour caps account-key rollovers per account
|
||||
// per rolling hour. 0 disables. Default: 5 (rollovers should be rare;
|
||||
// a flood is an attack signal). Setting:
|
||||
// CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR.
|
||||
RateLimitKeyChangePerHour int
|
||||
|
||||
// RateLimitChallengeRespondsPerHour caps challenge-respond requests
|
||||
// per challenge per rolling hour. 0 disables. Default: 60 (defends
|
||||
// against retry storms from a misbehaving client). Setting:
|
||||
// CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR.
|
||||
RateLimitChallengeRespondsPerHour int
|
||||
|
||||
// GCInterval is the tick interval for the ACME GC scheduler loop.
|
||||
// On each tick the loop sweeps expired nonces, transitions expired
|
||||
// pending authzs to `expired`, transitions expired
|
||||
// pending/ready/processing orders to `invalid`, and reaps Phase-2
|
||||
// atomicity-window orphans (orders without a linked cert when one
|
||||
// should exist). 0 disables the loop entirely. Default: 1m. Setting:
|
||||
// CERTCTL_ACME_SERVER_GC_INTERVAL.
|
||||
GCInterval time.Duration
|
||||
|
||||
// DirectoryMeta is the optional metadata advertised in the directory
|
||||
// document per RFC 8555 §7.1.1.
|
||||
DirectoryMeta ACMEServerDirectoryMeta
|
||||
@@ -1779,18 +1815,23 @@ func Load() (*Config, error) {
|
||||
// NonceTTL + DirectoryMeta. Order/Authz TTLs + concurrency
|
||||
// caps + DNS01 resolver are reserved (Phases 2/3 read).
|
||||
ACMEServer: ACMEServerConfig{
|
||||
Enabled: getEnvBool("CERTCTL_ACME_SERVER_ENABLED", false),
|
||||
DefaultAuthMode: getEnv("CERTCTL_ACME_SERVER_DEFAULT_AUTH_MODE", "trust_authenticated"),
|
||||
DefaultProfileID: getEnv("CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID", ""),
|
||||
NonceTTL: getEnvDuration("CERTCTL_ACME_SERVER_NONCE_TTL", 5*time.Minute),
|
||||
OrderTTL: getEnvDuration("CERTCTL_ACME_SERVER_ORDER_TTL", 24*time.Hour),
|
||||
AuthzTTL: getEnvDuration("CERTCTL_ACME_SERVER_AUTHZ_TTL", 24*time.Hour),
|
||||
HTTP01ConcurrencyMax: getEnvInt("CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY", 10),
|
||||
DNS01Resolver: getEnv("CERTCTL_ACME_SERVER_DNS01_RESOLVER", "8.8.8.8:53"),
|
||||
DNS01ConcurrencyMax: getEnvInt("CERTCTL_ACME_SERVER_DNS01_CONCURRENCY", 10),
|
||||
TLSALPN01ConcurrencyMax: getEnvInt("CERTCTL_ACME_SERVER_TLSALPN01_CONCURRENCY", 10),
|
||||
ARIEnabled: getEnvBool("CERTCTL_ACME_SERVER_ARI_ENABLED", true),
|
||||
ARIPollInterval: getEnvDuration("CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL", 6*time.Hour),
|
||||
Enabled: getEnvBool("CERTCTL_ACME_SERVER_ENABLED", false),
|
||||
DefaultAuthMode: getEnv("CERTCTL_ACME_SERVER_DEFAULT_AUTH_MODE", "trust_authenticated"),
|
||||
DefaultProfileID: getEnv("CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID", ""),
|
||||
NonceTTL: getEnvDuration("CERTCTL_ACME_SERVER_NONCE_TTL", 5*time.Minute),
|
||||
OrderTTL: getEnvDuration("CERTCTL_ACME_SERVER_ORDER_TTL", 24*time.Hour),
|
||||
AuthzTTL: getEnvDuration("CERTCTL_ACME_SERVER_AUTHZ_TTL", 24*time.Hour),
|
||||
HTTP01ConcurrencyMax: getEnvInt("CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY", 10),
|
||||
DNS01Resolver: getEnv("CERTCTL_ACME_SERVER_DNS01_RESOLVER", "8.8.8.8:53"),
|
||||
DNS01ConcurrencyMax: getEnvInt("CERTCTL_ACME_SERVER_DNS01_CONCURRENCY", 10),
|
||||
TLSALPN01ConcurrencyMax: getEnvInt("CERTCTL_ACME_SERVER_TLSALPN01_CONCURRENCY", 10),
|
||||
ARIEnabled: getEnvBool("CERTCTL_ACME_SERVER_ARI_ENABLED", true),
|
||||
ARIPollInterval: getEnvDuration("CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL", 6*time.Hour),
|
||||
RateLimitOrdersPerHour: getEnvInt("CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR", 100),
|
||||
RateLimitConcurrentOrders: getEnvInt("CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS", 5),
|
||||
RateLimitKeyChangePerHour: getEnvInt("CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR", 5),
|
||||
RateLimitChallengeRespondsPerHour: getEnvInt("CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR", 60),
|
||||
GCInterval: getEnvDuration("CERTCTL_ACME_SERVER_GC_INTERVAL", time.Minute),
|
||||
DirectoryMeta: ACMEServerDirectoryMeta{
|
||||
TermsOfService: getEnv("CERTCTL_ACME_SERVER_TOS_URL", ""),
|
||||
Website: getEnv("CERTCTL_ACME_SERVER_WEBSITE", ""),
|
||||
|
||||
@@ -751,6 +751,85 @@ func (r *ACMERepository) AccountOwnsCertificate(ctx context.Context, accountID,
|
||||
return count > 0, nil
|
||||
}
|
||||
|
||||
// --- Phase 5 — concurrent-orders count + GC sweeps ---------------------
|
||||
|
||||
// CountActiveOrdersByAccount returns the number of acme_orders rows
|
||||
// with the given account_id where status is in
|
||||
// {pending, ready, processing}. Used by the per-account
|
||||
// concurrent-orders rate limit.
|
||||
func (r *ACMERepository) CountActiveOrdersByAccount(ctx context.Context, accountID string) (int, error) {
|
||||
var count int
|
||||
err := r.db.QueryRowContext(ctx, `
|
||||
SELECT COUNT(1)
|
||||
FROM acme_orders
|
||||
WHERE account_id = $1
|
||||
AND status IN ('pending', 'ready', 'processing')
|
||||
`, accountID).Scan(&count)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("acme: count active orders: %w", err)
|
||||
}
|
||||
return count, nil
|
||||
}
|
||||
|
||||
// GCExpiredNonces deletes nonce rows that have been used or have
|
||||
// passed their expires_at. Returns rows-affected count for telemetry.
|
||||
// Phase 5 — called every GCInterval from the scheduler.
|
||||
func (r *ACMERepository) GCExpiredNonces(ctx context.Context) (int64, error) {
|
||||
res, err := r.db.ExecContext(ctx, `
|
||||
DELETE FROM acme_nonces
|
||||
WHERE used = TRUE OR expires_at < NOW()
|
||||
`)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("acme: gc expired nonces: %w", err)
|
||||
}
|
||||
n, err := res.RowsAffected()
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("acme: gc expired nonces rows affected: %w", err)
|
||||
}
|
||||
return n, nil
|
||||
}
|
||||
|
||||
// GCExpireAuthorizations transitions authzs in `pending` whose
|
||||
// expires_at < NOW() to `expired`. Authzs in valid/invalid are left
|
||||
// alone (they're already terminal). Returns rows-affected count.
|
||||
func (r *ACMERepository) GCExpireAuthorizations(ctx context.Context) (int64, error) {
|
||||
res, err := r.db.ExecContext(ctx, `
|
||||
UPDATE acme_authorizations
|
||||
SET status = 'expired', updated_at = NOW()
|
||||
WHERE status = 'pending' AND expires_at < NOW()
|
||||
`)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("acme: gc expire authorizations: %w", err)
|
||||
}
|
||||
n, err := res.RowsAffected()
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("acme: gc expire authorizations rows affected: %w", err)
|
||||
}
|
||||
return n, nil
|
||||
}
|
||||
|
||||
// GCInvalidateExpiredOrders transitions orders in
|
||||
// pending/ready/processing whose expires_at < NOW() to `invalid` with
|
||||
// a server-internal error. Orders in valid/invalid are terminal and
|
||||
// untouched.
|
||||
func (r *ACMERepository) GCInvalidateExpiredOrders(ctx context.Context) (int64, error) {
|
||||
const errBlob = `{"type":"urn:ietf:params:acme:error:serverInternal","detail":"order expired before issuance","status":500}`
|
||||
res, err := r.db.ExecContext(ctx, `
|
||||
UPDATE acme_orders
|
||||
SET status = 'invalid', error = $1::jsonb, updated_at = NOW()
|
||||
WHERE status IN ('pending', 'ready', 'processing')
|
||||
AND expires_at < NOW()
|
||||
`, errBlob)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("acme: gc invalidate expired orders: %w", err)
|
||||
}
|
||||
n, err := res.RowsAffected()
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("acme: gc invalidate expired orders rows affected: %w", err)
|
||||
}
|
||||
return n, nil
|
||||
}
|
||||
|
||||
// scanACMEAccount is the shared shape for the SELECT-by-X account
|
||||
// queries above. Returns sql.ErrNoRows-wrapped repository.ErrNotFound
|
||||
// on miss; any other scan failure surfaces verbatim.
|
||||
|
||||
@@ -77,6 +77,13 @@ type CRLCacheServicer interface {
|
||||
RegenerateAll(ctx context.Context)
|
||||
}
|
||||
|
||||
// ACMEGarbageCollector is the interface the scheduler's acmeGCLoop
|
||||
// invokes once per tick. The concrete implementation is *service.ACMEService.
|
||||
// Phase 5 — sweeps expired nonces / authzs / orders.
|
||||
type ACMEGarbageCollector interface {
|
||||
GarbageCollect(ctx context.Context) error
|
||||
}
|
||||
|
||||
// JobReaperService defines the interface for job timeout reaping used by the scheduler.
|
||||
type JobReaperService interface {
|
||||
ReapTimedOutJobs(ctx context.Context, csrTTL, approvalTTL time.Duration) error
|
||||
@@ -101,6 +108,7 @@ type Scheduler struct {
|
||||
healthCheckService HealthCheckServicer
|
||||
cloudDiscoveryService CloudDiscoveryServicer
|
||||
crlCacheService CRLCacheServicer
|
||||
acmeGC ACMEGarbageCollector
|
||||
jobReaper JobReaperService
|
||||
logger *slog.Logger
|
||||
|
||||
@@ -118,6 +126,7 @@ type Scheduler struct {
|
||||
cloudDiscoveryInterval time.Duration
|
||||
crlGenerationInterval time.Duration
|
||||
jobTimeoutInterval time.Duration
|
||||
acmeGCInterval time.Duration
|
||||
// agentOfflineJobTTL: per-tick threshold for reaping Running jobs whose
|
||||
// owning agent has been silent. Bundle C / Audit M-016. Defaults below.
|
||||
agentOfflineJobTTL time.Duration
|
||||
@@ -138,6 +147,7 @@ type Scheduler struct {
|
||||
cloudDiscoveryRunning atomic.Bool
|
||||
crlGenerationRunning atomic.Bool
|
||||
jobTimeoutRunning atomic.Bool
|
||||
acmeGCRunning atomic.Bool
|
||||
|
||||
// Graceful shutdown: wait for in-flight work to complete
|
||||
wg sync.WaitGroup
|
||||
@@ -174,6 +184,7 @@ func NewScheduler(
|
||||
cloudDiscoveryInterval: 6 * time.Hour,
|
||||
crlGenerationInterval: 1 * time.Hour,
|
||||
jobTimeoutInterval: 10 * time.Minute,
|
||||
acmeGCInterval: 1 * time.Minute,
|
||||
// 5 minutes is 5×agentHealthCheckInterval default of 1m; an agent
|
||||
// must miss multiple heartbeats before its in-flight jobs are reaped.
|
||||
agentOfflineJobTTL: 5 * time.Minute,
|
||||
@@ -287,6 +298,25 @@ func (s *Scheduler) SetJobReaperService(jr JobReaperService) {
|
||||
s.jobReaper = jr
|
||||
}
|
||||
|
||||
// SetACMEGarbageCollector wires the ACME GC service. Phase 5 — when
|
||||
// non-nil, an acmeGCLoop runs every acmeGCInterval and sweeps expired
|
||||
// nonces / authzs / orders. Optional: leaving nil disables the loop
|
||||
// (legacy behavior pre-Phase-5).
|
||||
func (s *Scheduler) SetACMEGarbageCollector(gc ACMEGarbageCollector) {
|
||||
s.acmeGC = gc
|
||||
}
|
||||
|
||||
// SetACMEGCInterval configures the interval at which the ACME GC sweep
|
||||
// runs. Default 1m. Operators with quiet fleets can lengthen to 5m;
|
||||
// operators expecting nonce-storms can shorten to 30s. Zero or
|
||||
// negative values are ignored.
|
||||
func (s *Scheduler) SetACMEGCInterval(d time.Duration) {
|
||||
if d <= 0 {
|
||||
return
|
||||
}
|
||||
s.acmeGCInterval = d
|
||||
}
|
||||
|
||||
// SetAgentOfflineJobTTL sets the threshold past which a Running job whose
|
||||
// owning agent has gone silent is reaped to Failed. Bundle C / Audit M-016.
|
||||
// Zero or negative values are ignored (the default of 5 minutes is kept).
|
||||
@@ -342,6 +372,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
|
||||
if s.crlCacheService != nil {
|
||||
loopCount++
|
||||
}
|
||||
if s.acmeGC != nil {
|
||||
loopCount++
|
||||
}
|
||||
s.wg.Add(loopCount)
|
||||
|
||||
go func() { defer s.wg.Done(); s.renewalCheckLoop(ctx) }()
|
||||
@@ -367,6 +400,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
|
||||
if s.crlCacheService != nil {
|
||||
go func() { defer s.wg.Done(); s.crlGenerationLoop(ctx) }()
|
||||
}
|
||||
if s.acmeGC != nil {
|
||||
go func() { defer s.wg.Done(); s.acmeGCLoop(ctx) }()
|
||||
}
|
||||
|
||||
// Signal that all loops are launched
|
||||
close(startedChan)
|
||||
@@ -1074,3 +1110,39 @@ func (s *Scheduler) runCRLGeneration(ctx context.Context) {
|
||||
|
||||
// ErrSchedulerShutdownTimeout is returned when scheduler graceful shutdown times out.
|
||||
var ErrSchedulerShutdownTimeout = errors.New("scheduler graceful shutdown timeout")
|
||||
|
||||
// acmeGCLoop runs every acmeGCInterval and invokes ACMEGarbageCollector.
|
||||
// Per CLAUDE.md "Scheduler idempotency" architecture decision: an
|
||||
// atomic.Bool guard prevents concurrent tick execution; the
|
||||
// sync.WaitGroup tracks the in-flight goroutine for graceful shutdown.
|
||||
// Phase 5.
|
||||
func (s *Scheduler) acmeGCLoop(ctx context.Context) {
|
||||
ticker := time.NewTicker(s.acmeGCInterval)
|
||||
defer ticker.Stop()
|
||||
|
||||
for {
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
return
|
||||
case <-ticker.C:
|
||||
if !s.acmeGCRunning.CompareAndSwap(false, true) {
|
||||
s.logger.Warn("ACME GC sweep still running, skipping tick")
|
||||
continue
|
||||
}
|
||||
s.wg.Add(1)
|
||||
go func() {
|
||||
defer s.wg.Done()
|
||||
defer s.acmeGCRunning.Store(false)
|
||||
// 1-minute timeout per sweep — the per-statement work is
|
||||
// cheap (single DELETE / UPDATE per sweep, all on indexed
|
||||
// columns), but bound the cycle so a stuck Postgres can't
|
||||
// block the next tick.
|
||||
opCtx, cancel := context.WithTimeout(ctx, time.Minute)
|
||||
defer cancel()
|
||||
if err := s.acmeGC.GarbageCollect(opCtx); err != nil {
|
||||
s.logger.Warn("acme gc sweep failed (next tick will retry)", "error", err)
|
||||
}
|
||||
}()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -56,6 +56,20 @@ type ACMERepo interface {
|
||||
// Phase 4 — key rollover + revocation auth.
|
||||
UpdateAccountJWKWithTx(ctx context.Context, q repository.Querier, accountID, expectedOldThumbprint, newThumbprint, newJWKPEM string) error
|
||||
AccountOwnsCertificate(ctx context.Context, accountID, certificateID string) (bool, error)
|
||||
// Phase 5 — per-account concurrent-order count + GC sweeps.
|
||||
// CountActiveOrdersByAccount returns the number of orders in
|
||||
// pending/ready/processing for the given account.
|
||||
CountActiveOrdersByAccount(ctx context.Context, accountID string) (int, error)
|
||||
// GCExpiredNonces deletes nonces whose expires_at < now() OR
|
||||
// used = true. Returns rows-affected count for telemetry.
|
||||
GCExpiredNonces(ctx context.Context) (int64, error)
|
||||
// GCExpireAuthorizations transitions authzs in `pending` whose
|
||||
// expires_at < now() to `expired`. Returns rows-affected count.
|
||||
GCExpireAuthorizations(ctx context.Context) (int64, error)
|
||||
// GCInvalidateExpiredOrders transitions orders in
|
||||
// pending/ready/processing whose expires_at < now() to `invalid`
|
||||
// with a server-internal error. Returns rows-affected count.
|
||||
GCInvalidateExpiredOrders(ctx context.Context) (int64, error)
|
||||
}
|
||||
|
||||
// CertificateRevoker is the minimum surface ACMEService needs to route
|
||||
@@ -142,6 +156,12 @@ type ACMEService struct {
|
||||
// and RenewalInfo returns the no-policy default window.
|
||||
revoker CertificateRevoker
|
||||
renewalPolicies RenewalPolicyLookup
|
||||
|
||||
// Phase 5 — per-account rate limiter. cmd/server/main.go constructs
|
||||
// an *acme.RateLimiter and wires it via SetRateLimiter. When unset
|
||||
// (tests, legacy bootstrap) the limiter calls short-circuit to
|
||||
// "always allow" — same shape as the validatorPool unset case.
|
||||
rateLimiter *acme.RateLimiter
|
||||
}
|
||||
|
||||
// NewACMEService constructs an ACMEService with the directory + nonce
|
||||
@@ -196,6 +216,16 @@ func (s *ACMEService) SetRevocationDelegate(r CertificateRevoker) { s.revoker =
|
||||
// still returns 200.
|
||||
func (s *ACMEService) SetRenewalPolicyLookup(r RenewalPolicyLookup) { s.renewalPolicies = r }
|
||||
|
||||
// SetRateLimiter wires Phase 5's per-account rate limiter. Optional —
|
||||
// when nil, the per-action rate-limit checks short-circuit to
|
||||
// "always allow" so the legacy code path stays unchanged for bootstrap
|
||||
// + tests that don't care about throttling.
|
||||
func (s *ACMEService) SetRateLimiter(r *acme.RateLimiter) { s.rateLimiter = r }
|
||||
|
||||
// RateLimiter returns the wired limiter so the handler can compute
|
||||
// Retry-After durations on rate-limited responses without re-checking.
|
||||
func (s *ACMEService) RateLimiter() *acme.RateLimiter { return s.rateLimiter }
|
||||
|
||||
// SetValidatorPool wires Phase 3's challenge validator pool.
|
||||
// cmd/server/main.go constructs an *acme.Pool at startup with the
|
||||
// per-type concurrency caps from cfg.ACMEServer. Optional —
|
||||
@@ -334,6 +364,19 @@ var ErrACMEARIDisabled = errors.New("acme: ARI is disabled on this server")
|
||||
// not RFC 9773 §4.1 shape. Handler maps to 400 + malformed.
|
||||
var ErrACMEARIBadCertID = errors.New("acme: ARI cert-id is malformed")
|
||||
|
||||
// Phase 5 sentinels.
|
||||
|
||||
// ErrACMERateLimited is returned when the per-action rate limit fires.
|
||||
// Handler maps to RFC 7807 + RFC 8555 §6.7
|
||||
// `urn:ietf:params:acme:error:rateLimited` with a Retry-After header.
|
||||
var ErrACMERateLimited = errors.New("acme: rate limit exceeded")
|
||||
|
||||
// ErrACMEConcurrentOrdersExceeded is returned by CreateOrder when the
|
||||
// account already has cfg.RateLimitConcurrentOrders orders in
|
||||
// pending/ready/processing. Handler maps to rateLimited (RFC 8555 §6.7
|
||||
// shape; the certctl-side cause is concurrency rather than per-hour).
|
||||
var ErrACMEConcurrentOrdersExceeded = errors.New("acme: concurrent orders limit exceeded")
|
||||
|
||||
// BuildDirectory constructs the per-profile directory document.
|
||||
//
|
||||
// profileID resolution:
|
||||
@@ -463,6 +506,13 @@ type ACMEMetrics struct {
|
||||
RevokeCertFailTotal atomic.Uint64 // rejected revocation (4xx)
|
||||
RenewalInfoTotal atomic.Uint64 // ARI 200
|
||||
RenewalInfoFailTotal atomic.Uint64 // ARI 4xx
|
||||
|
||||
// Phase 5 — GC sweep counts (per-tick rows-affected, summed).
|
||||
GCNoncesReapedTotal atomic.Uint64
|
||||
GCAuthzsExpiredTotal atomic.Uint64
|
||||
GCOrdersInvalidatedTotal atomic.Uint64
|
||||
GCRunsTotal atomic.Uint64
|
||||
GCRunFailuresTotal atomic.Uint64
|
||||
}
|
||||
|
||||
// NewACMEMetrics returns a zeroed counter table. Concurrent callers
|
||||
@@ -508,6 +558,11 @@ func (m *ACMEMetrics) Snapshot() map[string]uint64 {
|
||||
"certctl_acme_revoke_cert_failures_total": m.RevokeCertFailTotal.Load(),
|
||||
"certctl_acme_renewal_info_total": m.RenewalInfoTotal.Load(),
|
||||
"certctl_acme_renewal_info_failures_total": m.RenewalInfoFailTotal.Load(),
|
||||
"certctl_acme_gc_nonces_reaped_total": m.GCNoncesReapedTotal.Load(),
|
||||
"certctl_acme_gc_authzs_expired_total": m.GCAuthzsExpiredTotal.Load(),
|
||||
"certctl_acme_gc_orders_invalidated_total": m.GCOrdersInvalidatedTotal.Load(),
|
||||
"certctl_acme_gc_runs_total": m.GCRunsTotal.Load(),
|
||||
"certctl_acme_gc_run_failures_total": m.GCRunFailuresTotal.Load(),
|
||||
}
|
||||
}
|
||||
|
||||
@@ -783,6 +838,27 @@ func (s *ACMEService) CreateOrder(
|
||||
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
|
||||
return nil, fmt.Errorf("acme: new-order requires SetTransactor + SetAuditService")
|
||||
}
|
||||
// Phase 5 — per-account orders/hour cap. Hits return rateLimited
|
||||
// (RFC 8555 §6.7) before any DB work. Counter is in-memory; restart
|
||||
// wipes (eventual-consistency caps are acceptable).
|
||||
if s.rateLimiter != nil && s.cfg.RateLimitOrdersPerHour > 0 {
|
||||
if !s.rateLimiter.Allow(acme.ActionNewOrder, accountID, s.cfg.RateLimitOrdersPerHour) {
|
||||
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
|
||||
return nil, ErrACMERateLimited
|
||||
}
|
||||
}
|
||||
// Phase 5 — concurrent-orders cap. We count
|
||||
// pending/ready/processing orders for this account; if at-or-over
|
||||
// the cap, reject. This is a DB read (no FOR UPDATE), so two
|
||||
// requests racing under the threshold can both succeed and push
|
||||
// the account one over — accepted as eventual-consistency.
|
||||
if s.cfg.RateLimitConcurrentOrders > 0 {
|
||||
count, cerr := s.repo.CountActiveOrdersByAccount(ctx, accountID)
|
||||
if cerr == nil && count >= s.cfg.RateLimitConcurrentOrders {
|
||||
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
|
||||
return nil, ErrACMEConcurrentOrdersExceeded
|
||||
}
|
||||
}
|
||||
resolvedProfileID, err := s.resolveProfile(ctx, profileID)
|
||||
if err != nil {
|
||||
s.metrics.bump(&s.metrics.NewOrderFailureTotal)
|
||||
@@ -1285,6 +1361,16 @@ func (s *ACMEService) RespondToChallenge(
|
||||
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
|
||||
return nil, ErrACMEChallengePoolUnconfigured
|
||||
}
|
||||
// Phase 5 — per-challenge respond rate limit. Defends against retry
|
||||
// storms from a misbehaving client. Keyed by challengeID (not
|
||||
// accountID) so a flood against one challenge doesn't drain the
|
||||
// account's whole budget.
|
||||
if s.rateLimiter != nil && s.cfg.RateLimitChallengeRespondsPerHour > 0 {
|
||||
if !s.rateLimiter.Allow(acme.ActionChallengeRespond, challengeID, s.cfg.RateLimitChallengeRespondsPerHour) {
|
||||
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
|
||||
return nil, ErrACMERateLimited
|
||||
}
|
||||
}
|
||||
|
||||
ch, err := s.repo.GetChallengeByID(ctx, challengeID)
|
||||
if err != nil {
|
||||
@@ -1508,6 +1594,14 @@ func (s *ACMEService) RotateAccountKey(
|
||||
s.metrics.bump(&s.metrics.KeyChangeFailTotal)
|
||||
return nil, ErrACMEKeyRolloverInvalid
|
||||
}
|
||||
// Phase 5 — rollovers/hour cap. Defaults to 5/hour: a flood is an
|
||||
// attack signal (key rotation should be rare). Keyed by accountID.
|
||||
if s.rateLimiter != nil && s.cfg.RateLimitKeyChangePerHour > 0 {
|
||||
if !s.rateLimiter.Allow(acme.ActionKeyChange, oldAccount.AccountID, s.cfg.RateLimitKeyChangePerHour) {
|
||||
s.metrics.bump(&s.metrics.KeyChangeFailTotal)
|
||||
return nil, ErrACMERateLimited
|
||||
}
|
||||
}
|
||||
|
||||
newThumbprint, err := acme.JWKThumbprint(newJWK)
|
||||
if err != nil {
|
||||
@@ -1816,3 +1910,56 @@ func mapACMERevocationReason(code int) string {
|
||||
return string(domain.RevocationReasonUnspecified)
|
||||
}
|
||||
}
|
||||
|
||||
// GarbageCollect runs a single ACME GC sweep. Phase 5 — the scheduler
|
||||
// invokes this every cfg.GCInterval. Three independent sweeps:
|
||||
//
|
||||
// 1. Delete used / expired nonces.
|
||||
// 2. Transition expired pending authzs to `expired`.
|
||||
// 3. Transition expired pending/ready/processing orders to `invalid`.
|
||||
//
|
||||
// Each sweep is a single SQL statement (no per-row transactions) so a
|
||||
// large reap is one atomic write per sweep. Per-sweep errors are
|
||||
// logged-and-continued: a failing nonces sweep doesn't block the
|
||||
// authzs sweep. Returns the first error encountered (for caller
|
||||
// telemetry); per-sweep counts are recorded on metrics regardless.
|
||||
//
|
||||
// Idempotent — repeated runs are safe; the second run finds 0 rows.
|
||||
func (s *ACMEService) GarbageCollect(ctx context.Context) error {
|
||||
s.metrics.bump(&s.metrics.GCRunsTotal)
|
||||
var firstErr error
|
||||
|
||||
if n, err := s.repo.GCExpiredNonces(ctx); err != nil {
|
||||
s.metrics.bump(&s.metrics.GCRunFailuresTotal)
|
||||
if firstErr == nil {
|
||||
firstErr = fmt.Errorf("acme gc: nonces: %w", err)
|
||||
}
|
||||
} else if n > 0 {
|
||||
atomicAddUint64(&s.metrics.GCNoncesReapedTotal, uint64(n))
|
||||
}
|
||||
|
||||
if n, err := s.repo.GCExpireAuthorizations(ctx); err != nil {
|
||||
s.metrics.bump(&s.metrics.GCRunFailuresTotal)
|
||||
if firstErr == nil {
|
||||
firstErr = fmt.Errorf("acme gc: authzs: %w", err)
|
||||
}
|
||||
} else if n > 0 {
|
||||
atomicAddUint64(&s.metrics.GCAuthzsExpiredTotal, uint64(n))
|
||||
}
|
||||
|
||||
if n, err := s.repo.GCInvalidateExpiredOrders(ctx); err != nil {
|
||||
s.metrics.bump(&s.metrics.GCRunFailuresTotal)
|
||||
if firstErr == nil {
|
||||
firstErr = fmt.Errorf("acme gc: orders: %w", err)
|
||||
}
|
||||
} else if n > 0 {
|
||||
atomicAddUint64(&s.metrics.GCOrdersInvalidatedTotal, uint64(n))
|
||||
}
|
||||
|
||||
return firstErr
|
||||
}
|
||||
|
||||
// atomicAddUint64 adds delta to the counter. The metrics struct exposes
|
||||
// only `bump` (add 1) by default; this helper covers the
|
||||
// rows-affected-N case the GC needs.
|
||||
func atomicAddUint64(c *atomic.Uint64, delta uint64) { c.Add(delta) }
|
||||
|
||||
@@ -164,6 +164,23 @@ func (f *fakeACMERepo) UpdateAccountJWKWithTx(ctx context.Context, q repository.
|
||||
func (f *fakeACMERepo) AccountOwnsCertificate(ctx context.Context, accountID, certificateID string) (bool, error) {
|
||||
return false, nil
|
||||
}
|
||||
func (f *fakeACMERepo) CountActiveOrdersByAccount(ctx context.Context, accountID string) (int, error) {
|
||||
return 0, nil
|
||||
}
|
||||
func (f *fakeACMERepo) GCExpiredNonces(ctx context.Context) (int64, error) {
|
||||
n := int64(0)
|
||||
for nonce, exp := range f.issued {
|
||||
if time.Now().After(exp) {
|
||||
delete(f.issued, nonce)
|
||||
n++
|
||||
}
|
||||
}
|
||||
return n, nil
|
||||
}
|
||||
func (f *fakeACMERepo) GCExpireAuthorizations(ctx context.Context) (int64, error) { return 0, nil }
|
||||
func (f *fakeACMERepo) GCInvalidateExpiredOrders(ctx context.Context) (int64, error) {
|
||||
return 0, nil
|
||||
}
|
||||
|
||||
// fakeTransactor is the repository.Transactor stand-in: runs fn
|
||||
// against the supplied querier (we just pass nil — fakes ignore it).
|
||||
|
||||
Reference in New Issue
Block a user