mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 16:31:33 +00:00
bee47f0318
Closes the production-readiness loop on the ACME surface. After this
commit, certctl ships per-account rate limits + a GC sweeper for
expired ACME state + a kind-driven cert-manager 1.15 integration test
+ a lego-driven RFC conformance harness + a k6 loadtest scenario for
the unauthenticated ACME path.
Architecture:
- Rate limits live in-memory + per-replica. Restart wipes the
counters; orders/hour caps are eventual-consistency anyway. A
3-replica certctl-server fleet behind an LB effectively has 3x
the configured throughput per account; persistent rate limiting
is a follow-up if production telemetry shows abuse patterns we
can't catch in a single restart cycle. Per-key + per-action
isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and
ActionChallengeRespond/<challenge-id> are independent buckets.
- GC loop follows the existing scheduler-loop pattern (atomic.Bool
+ sync.WaitGroup; see crlGenerationLoop for shape). Three
independent SQL sweeps per tick (DELETE expired nonces; UPDATE
pending authzs whose expires_at < now() to expired; UPDATE
pending/ready/processing orders whose expires_at < now() to
invalid). Each sweep is a single statement; failures are logged-
and-continued so a failing nonces sweep doesn't block authzs.
Per-sweep 1m timeout bounds a stuck Postgres.
- cert-manager integration test is gated on KIND_AVAILABLE so CI
skips it cleanly (kind is too heavy for per-PR). Operators run
locally via 'make acme-cert-manager-test'; the harness brings up
a fresh cluster each run + tears it down on Cleanup.
- lego conformance harness drives a real ACME client through
register → run → cert-PEM-landed against a hermetic certctl
stack. Catches RFC-shape regressions third-party clients would
hit before they ship.
- k6 ACME-flow scenario hammers the unauthenticated surface
(directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS-
signed flows are out of scope for k6 (no JWS support); they're
covered by the lego harness above.
What ships:
- internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases —
disable-when-perHour-zero, capacity, per-key isolation, per-
action isolation, refill-over-time, RetryAfter, concurrent-access
with -race + 200 goroutines × 200 calls).
- internal/repository/postgres/acme.go: 4 new methods —
CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations
+ GCInvalidateExpiredOrders. Each a single SQL statement.
- internal/service/acme.go: SetRateLimiter + GarbageCollect +
rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey
+ RespondToChallenge) + concurrent-orders gate at CreateOrder.
2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded);
5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped /
gc_authzs_expired / gc_orders_invalidated).
- internal/scheduler/scheduler.go: ACMEGarbageCollector interface +
acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME-
GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the
crlGenerationLoop shape.
- internal/api/handler/acme.go: writeServiceError gains rateLimited
(429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings.
- internal/config/config.go: 5 new env vars
(CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100,
CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5,
CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5,
CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60,
CERTCTL_ACME_SERVER_GC_INTERVAL=1m).
- cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at
startup; conditional SetACMEGarbageCollector(acmeService) +
SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+
GCInterval > 0.
- deploy/test/acme-integration/: kind-config.yaml + cert-manager-
install.sh + clusterissuer-trust-authenticated.yaml +
clusterissuer-challenge.yaml + certificate-test.yaml + conformance-
lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE
gate).
- deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section.
- Makefile: 2 new PHONY targets (acme-cert-manager-test +
acme-rfc-conformance-test).
- docs/acme-server.md: status flipped to Phase 5; Configuration
table grows 5 rows; new 'Phase 5 — operational guidance' section
explaining rate-limit math + GC sweeper semantics + cert-manager
integration + lego conformance + k6 baseline.
Tests:
- 'go vet ./...' clean across the repo.
- 'go test -short -count=1 ./internal/...' green across every
affected package (service / acme / handler / scheduler / repo /
config).
- 'go vet -tags=integration ./deploy/test/acme-integration/' clean
(the integration test compiles cleanly with the build tag).
- The kind/cert-manager harness is gated behind KIND_AVAILABLE so
CI skips by default; operators run locally via 'make acme-cert-
manager-test'.
Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
167 lines
5.1 KiB
Go
167 lines
5.1 KiB
Go
// Copyright (c) certctl
|
|
// SPDX-License-Identifier: BSL-1.1
|
|
|
|
package acme
|
|
|
|
import (
|
|
"errors"
|
|
"sync"
|
|
"time"
|
|
)
|
|
|
|
// Phase 5 — per-account rolling-hour rate limiter for ACME operations.
|
|
//
|
|
// Architecture:
|
|
// - In-memory token-bucket per (key, action). Restart wipes the
|
|
// buckets; orders/hour caps are eventual-consistency so this is
|
|
// acceptable. Persistent rate limiting is a follow-up if production
|
|
// telemetry shows abuse patterns we can't catch in a single restart
|
|
// cycle (master prompt criterion #11 explicitly accepts this).
|
|
// - Tokens-per-hour math: bucket capacity = perHour, refill rate =
|
|
// perHour / 3600 tokens/sec. A fresh bucket starts full; an over-
|
|
// limit caller drains it then has to wait for replenishment.
|
|
// - Key shape is action-specific: orders use accountID; key-rollover
|
|
// uses accountID; challenge-respond uses challengeID (so a flood
|
|
// against one challenge doesn't burn the whole account's budget).
|
|
//
|
|
// Concurrency: the outer map is RWMutex-guarded for create-on-demand;
|
|
// per-bucket allow() takes a tiny per-bucket Mutex. Mirrors the
|
|
// existing internal/api/middleware/middleware.go::keyedRateLimiter
|
|
// pattern (different scope, same shape).
|
|
|
|
// RateLimiter is the per-action token-bucket pool. Construct with
|
|
// NewRateLimiter(); pass a single instance into ACMEService via
|
|
// SetRateLimiter so all entry points share the same buckets.
|
|
type RateLimiter struct {
|
|
mu sync.RWMutex
|
|
buckets map[string]*rlBucket // keyed by "<action>|<keyID>"
|
|
clock func() time.Time // injectable for tests
|
|
}
|
|
|
|
// NewRateLimiter returns an empty RateLimiter. Buckets are created on
|
|
// first reference, so a fresh limiter does no work until traffic
|
|
// arrives.
|
|
func NewRateLimiter() *RateLimiter {
|
|
return &RateLimiter{
|
|
buckets: make(map[string]*rlBucket),
|
|
clock: time.Now,
|
|
}
|
|
}
|
|
|
|
// SetClock replaces the clock for tests. Production callers leave it
|
|
// pointing at time.Now (the constructor default).
|
|
func (r *RateLimiter) SetClock(now func() time.Time) {
|
|
if now != nil {
|
|
r.clock = now
|
|
}
|
|
}
|
|
|
|
// Allow returns true when the (action, keyID) bucket has at least one
|
|
// token available — and consumes that token. perHour=0 disables the
|
|
// limit (always true). Negative perHour is treated as 0.
|
|
//
|
|
// On hit (first call → first token consumed → returns true). Once
|
|
// drained, further calls within the same hour return false until
|
|
// elapsed-time refills the bucket.
|
|
func (r *RateLimiter) Allow(action, keyID string, perHour int) bool {
|
|
if perHour <= 0 {
|
|
return true
|
|
}
|
|
bucketKey := action + "|" + keyID
|
|
r.mu.RLock()
|
|
b, ok := r.buckets[bucketKey]
|
|
r.mu.RUnlock()
|
|
if !ok {
|
|
r.mu.Lock()
|
|
b, ok = r.buckets[bucketKey]
|
|
if !ok {
|
|
b = &rlBucket{
|
|
capacity: float64(perHour),
|
|
refillRate: float64(perHour) / 3600.0, // tokens/sec
|
|
tokens: float64(perHour),
|
|
lastRefill: r.clock(),
|
|
}
|
|
r.buckets[bucketKey] = b
|
|
}
|
|
r.mu.Unlock()
|
|
}
|
|
return b.allow(r.clock)
|
|
}
|
|
|
|
// RetryAfter returns the duration the caller should wait before the
|
|
// (action, keyID) bucket has at least one token again. Returns 0 when
|
|
// at least one token is currently available. Used by the handler to
|
|
// emit a Retry-After header on rateLimited responses.
|
|
func (r *RateLimiter) RetryAfter(action, keyID string, perHour int) time.Duration {
|
|
if perHour <= 0 {
|
|
return 0
|
|
}
|
|
bucketKey := action + "|" + keyID
|
|
r.mu.RLock()
|
|
b, ok := r.buckets[bucketKey]
|
|
r.mu.RUnlock()
|
|
if !ok {
|
|
return 0
|
|
}
|
|
b.mu.Lock()
|
|
defer b.mu.Unlock()
|
|
if b.tokens >= 1 {
|
|
return 0
|
|
}
|
|
missing := 1 - b.tokens
|
|
if b.refillRate <= 0 {
|
|
// Shouldn't happen (Allow rejects perHour<=0 before bucket
|
|
// creation), but a divide-by-zero here would panic.
|
|
return time.Hour
|
|
}
|
|
secs := missing / b.refillRate
|
|
return time.Duration(secs * float64(time.Second))
|
|
}
|
|
|
|
// rlBucket is the per-(action, keyID) token bucket. Mirrors the shape
|
|
// of internal/api/middleware/middleware.go::tokenBucket but with a
|
|
// per-hour-shaped refill instead of per-second.
|
|
type rlBucket struct {
|
|
mu sync.Mutex
|
|
capacity float64
|
|
refillRate float64 // tokens per second
|
|
tokens float64
|
|
lastRefill time.Time
|
|
}
|
|
|
|
func (b *rlBucket) allow(clock func() time.Time) bool {
|
|
b.mu.Lock()
|
|
defer b.mu.Unlock()
|
|
|
|
now := clock()
|
|
// Monotonic-clock-safe via t.Sub(t) per Go time-package contract.
|
|
elapsed := now.Sub(b.lastRefill).Seconds()
|
|
if elapsed > 0 {
|
|
b.tokens += elapsed * b.refillRate
|
|
if b.tokens > b.capacity {
|
|
b.tokens = b.capacity
|
|
}
|
|
b.lastRefill = now
|
|
}
|
|
if b.tokens < 1 {
|
|
return false
|
|
}
|
|
b.tokens--
|
|
return true
|
|
}
|
|
|
|
// Action constants — keep one source of truth for the bucket-key
|
|
// `<action>|...` prefix. Using untyped consts (not iota) so they
|
|
// survive cross-process coordination if a follow-up adds shared-state
|
|
// rate-limiting.
|
|
const (
|
|
ActionNewOrder = "new_order"
|
|
ActionKeyChange = "key_change"
|
|
ActionChallengeRespond = "challenge_respond"
|
|
)
|
|
|
|
// ErrRateLimited is the sentinel service-layer entry points return on
|
|
// a hit. Handler maps to RFC 7807 + RFC 8555 §6.7
|
|
// `urn:ietf:params:acme:error:rateLimited` with Retry-After.
|
|
var ErrRateLimited = errors.New("acme: rate limit exceeded")
|