acme-server: cert-manager integration test + production hardening (Phase 5/7)

Closes the production-readiness loop on the ACME surface. After this
commit, certctl ships per-account rate limits + a GC sweeper for
expired ACME state + a kind-driven cert-manager 1.15 integration test
+ a lego-driven RFC conformance harness + a k6 loadtest scenario for
the unauthenticated ACME path.

Architecture:
  - Rate limits live in-memory + per-replica. Restart wipes the
    counters; orders/hour caps are eventual-consistency anyway. A
    3-replica certctl-server fleet behind an LB effectively has 3x
    the configured throughput per account; persistent rate limiting
    is a follow-up if production telemetry shows abuse patterns we
    can't catch in a single restart cycle. Per-key + per-action
    isolation: ActionNewOrder/acc-1, ActionKeyChange/acc-1, and
    ActionChallengeRespond/<challenge-id> are independent buckets.
  - GC loop follows the existing scheduler-loop pattern (atomic.Bool
    + sync.WaitGroup; see crlGenerationLoop for shape). Three
    independent SQL sweeps per tick (DELETE expired nonces; UPDATE
    pending authzs whose expires_at < now() to expired; UPDATE
    pending/ready/processing orders whose expires_at < now() to
    invalid). Each sweep is a single statement; failures are logged-
    and-continued so a failing nonces sweep doesn't block authzs.
    Per-sweep 1m timeout bounds a stuck Postgres.
  - cert-manager integration test is gated on KIND_AVAILABLE so CI
    skips it cleanly (kind is too heavy for per-PR). Operators run
    locally via 'make acme-cert-manager-test'; the harness brings up
    a fresh cluster each run + tears it down on Cleanup.
  - lego conformance harness drives a real ACME client through
    register → run → cert-PEM-landed against a hermetic certctl
    stack. Catches RFC-shape regressions third-party clients would
    hit before they ship.
  - k6 ACME-flow scenario hammers the unauthenticated surface
    (directory + new-nonce + ARI synthetic-id) at 100 VUs × 5m. JWS-
    signed flows are out of scope for k6 (no JWS support); they're
    covered by the lego harness above.

What ships:
  - internal/api/acme/ratelimit.go (+ ratelimit_test.go: 7 cases —
    disable-when-perHour-zero, capacity, per-key isolation, per-
    action isolation, refill-over-time, RetryAfter, concurrent-access
    with -race + 200 goroutines × 200 calls).
  - internal/repository/postgres/acme.go: 4 new methods —
    CountActiveOrdersByAccount + GCExpiredNonces + GCExpireAuthorizations
    + GCInvalidateExpiredOrders. Each a single SQL statement.
  - internal/service/acme.go: SetRateLimiter + GarbageCollect +
    rate-limit gates at 3 entry points (CreateOrder + RotateAccountKey
    + RespondToChallenge) + concurrent-orders gate at CreateOrder.
    2 new sentinels (ErrACMERateLimited, ErrACMEConcurrentOrdersExceeded);
    5 new GC metrics (gc_runs / gc_run_failures / gc_nonces_reaped /
    gc_authzs_expired / gc_orders_invalidated).
  - internal/scheduler/scheduler.go: ACMEGarbageCollector interface +
    acmeGCRunning atomic.Bool + acmeGCInterval + 2 setters (SetACME-
    GarbageCollector + SetACMEGCInterval) + acmeGCLoop following the
    crlGenerationLoop shape.
  - internal/api/handler/acme.go: writeServiceError gains rateLimited
    (429 + RFC 8555 §6.7) + concurrent-orders-exceeded mappings.
  - internal/config/config.go: 5 new env vars
    (CERTCTL_ACME_SERVER_RATE_LIMIT_ORDERS_PER_HOUR=100,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CONCURRENT_ORDERS=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_KEY_CHANGE_PER_HOUR=5,
    CERTCTL_ACME_SERVER_RATE_LIMIT_CHALLENGE_RESPONDS_PER_HOUR=60,
    CERTCTL_ACME_SERVER_GC_INTERVAL=1m).
  - cmd/server/main.go: NewRateLimiter() + SetRateLimiter() at
    startup; conditional SetACMEGarbageCollector(acmeService) +
    SetACMEGCInterval(cfg.ACMEServer.GCInterval) when Enabled+
    GCInterval > 0.
  - deploy/test/acme-integration/: kind-config.yaml + cert-manager-
    install.sh + clusterissuer-trust-authenticated.yaml +
    clusterissuer-challenge.yaml + certificate-test.yaml + conformance-
    lego.sh + certmanager_test.go (//go:build integration + KIND_AVAILABLE
    gate).
  - deploy/test/loadtest/k6/acme_flow.js + README ACME-flows section.
  - Makefile: 2 new PHONY targets (acme-cert-manager-test +
    acme-rfc-conformance-test).
  - docs/acme-server.md: status flipped to Phase 5; Configuration
    table grows 5 rows; new 'Phase 5 — operational guidance' section
    explaining rate-limit math + GC sweeper semantics + cert-manager
    integration + lego conformance + k6 baseline.

Tests:
  - 'go vet ./...' clean across the repo.
  - 'go test -short -count=1 ./internal/...' green across every
    affected package (service / acme / handler / scheduler / repo /
    config).
  - 'go vet -tags=integration ./deploy/test/acme-integration/' clean
    (the integration test compiles cleanly with the build tag).
  - The kind/cert-manager harness is gated behind KIND_AVAILABLE so
    CI skips by default; operators run locally via 'make acme-cert-
    manager-test'.

Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-5'.
This commit is contained in:
shankar0123
2026-05-03 19:42:03 +00:00
parent 9bfbac0f97
commit bee47f0318
20 changed files with 1341 additions and 21 deletions
+79
View File
@@ -751,6 +751,85 @@ func (r *ACMERepository) AccountOwnsCertificate(ctx context.Context, accountID,
return count > 0, nil
}
// --- Phase 5 — concurrent-orders count + GC sweeps ---------------------
// CountActiveOrdersByAccount returns the number of acme_orders rows
// with the given account_id where status is in
// {pending, ready, processing}. Used by the per-account
// concurrent-orders rate limit.
func (r *ACMERepository) CountActiveOrdersByAccount(ctx context.Context, accountID string) (int, error) {
var count int
err := r.db.QueryRowContext(ctx, `
SELECT COUNT(1)
FROM acme_orders
WHERE account_id = $1
AND status IN ('pending', 'ready', 'processing')
`, accountID).Scan(&count)
if err != nil {
return 0, fmt.Errorf("acme: count active orders: %w", err)
}
return count, nil
}
// GCExpiredNonces deletes nonce rows that have been used or have
// passed their expires_at. Returns rows-affected count for telemetry.
// Phase 5 — called every GCInterval from the scheduler.
func (r *ACMERepository) GCExpiredNonces(ctx context.Context) (int64, error) {
res, err := r.db.ExecContext(ctx, `
DELETE FROM acme_nonces
WHERE used = TRUE OR expires_at < NOW()
`)
if err != nil {
return 0, fmt.Errorf("acme: gc expired nonces: %w", err)
}
n, err := res.RowsAffected()
if err != nil {
return 0, fmt.Errorf("acme: gc expired nonces rows affected: %w", err)
}
return n, nil
}
// GCExpireAuthorizations transitions authzs in `pending` whose
// expires_at < NOW() to `expired`. Authzs in valid/invalid are left
// alone (they're already terminal). Returns rows-affected count.
func (r *ACMERepository) GCExpireAuthorizations(ctx context.Context) (int64, error) {
res, err := r.db.ExecContext(ctx, `
UPDATE acme_authorizations
SET status = 'expired', updated_at = NOW()
WHERE status = 'pending' AND expires_at < NOW()
`)
if err != nil {
return 0, fmt.Errorf("acme: gc expire authorizations: %w", err)
}
n, err := res.RowsAffected()
if err != nil {
return 0, fmt.Errorf("acme: gc expire authorizations rows affected: %w", err)
}
return n, nil
}
// GCInvalidateExpiredOrders transitions orders in
// pending/ready/processing whose expires_at < NOW() to `invalid` with
// a server-internal error. Orders in valid/invalid are terminal and
// untouched.
func (r *ACMERepository) GCInvalidateExpiredOrders(ctx context.Context) (int64, error) {
const errBlob = `{"type":"urn:ietf:params:acme:error:serverInternal","detail":"order expired before issuance","status":500}`
res, err := r.db.ExecContext(ctx, `
UPDATE acme_orders
SET status = 'invalid', error = $1::jsonb, updated_at = NOW()
WHERE status IN ('pending', 'ready', 'processing')
AND expires_at < NOW()
`, errBlob)
if err != nil {
return 0, fmt.Errorf("acme: gc invalidate expired orders: %w", err)
}
n, err := res.RowsAffected()
if err != nil {
return 0, fmt.Errorf("acme: gc invalidate expired orders rows affected: %w", err)
}
return n, nil
}
// scanACMEAccount is the shared shape for the SELECT-by-X account
// queries above. Returns sql.ErrNoRows-wrapped repository.ErrNotFound
// on miss; any other scan failure surfaces verbatim.