acme-server: HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (Phase 3/7)

Wires up the actual challenge-validation machinery so profiles in
acme_auth_mode='challenge' resolve end-to-end. After this commit,
cert-manager 1.15+ with `solver: http01: ingress` against a
challenge-mode profile completes a real HTTP-01 flow and gets a cert.
DNS-01 + TLS-ALPN-01 share the same code path with the appropriate
validator selection.

Architecture (the load-bearing parts):
  - 3 separate semaphore-bounded worker pools (one per challenge type),
    so HTTP-01 and DNS-01 can't starve each other under load. Default
    weight 10 per type; tunable via CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY,
    DNS01_CONCURRENCY, TLSALPN01_CONCURRENCY.
  - 30s per-challenge timeout (configurable via PoolConfig.PerChallengeTimeout).
  - HTTP-01 validator runs validation.IsReservedIPForDial (newly
    exported wrapper preserving the existing private impl byte-for-byte
    for the network scanner + ValidateSafeURL paths) on the resolved
    IP — both at the initial dial and every redirect hop. SSRF probes
    into private IP space are refused before the connect.
  - DNS-01 validator uses a dedicated resolver pointed at
    CERTCTL_ACME_SERVER_DNS01_RESOLVER (default 8.8.8.8:53) — does
    NOT use the system resolver to keep behavior deterministic across
    deployments. Wildcard handling: `*.example.com` queries
    _acme-challenge.example.com.
  - TLS-ALPN-01 validator (RFC 8737) connects with ALPN `acme-tls/1`,
    inspects the id-pe-acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31),
    asserts the ASN.1 OCTET STRING value equals SHA-256 of the key
    authorization. Cert chain is intentionally NOT validated
    (InsecureSkipVerify=true is correct per RFC 8737 — the proof is
    in the extension, not the chain). Documented in docs/tls.md L-001
    table + the //nolint:gosec comment carries the justification.
    SSRF guard: same posture as HTTP-01.
  - Validation is asynchronous: handler accepts the POST and returns
    200 immediately with status=processing; the worker-pool fires a
    callback that updates challenge → authz → order in a fresh
    background-context WithinTx. The order auto-promotes to `ready`
    when ALL authzs become valid; auto-fails to `invalid` when ANY
    authz becomes invalid.

What ships:
  - internal/api/acme/challenge.go: KeyAuthorization (RFC 8555 §8.1) +
    DNS01TXTRecordValue (§8.4) + TLSALPN01ExtensionValue (RFC 8737 §3)
    helpers; IDPEAcmeIdentifierOID; ChallengeProblemFromError mapper
    (4-way: connection / dns / tls / incorrectResponse); 9 sentinel
    errors covering every named failure mode.
  - internal/api/acme/validators.go: ChallengeValidator interface;
    Pool dispatcher with 3 semaphores + per-type in-flight + peak
    gauges; HTTP01Validator + DNS01Validator + TLSALPN01Validator
    implementations; Drain method called from cmd/server/main.go's
    shutdown sequence.
  - internal/api/acme/validators_test.go: KeyAuthorization round-trip,
    DNS01 / TLS-ALPN-01 helper tests, SSRF rejection, bounded-
    concurrency saturation test (peak-in-flight ≤ cap), type-isolation
    test (HTTP-01 saturation doesn't block DNS-01), UnknownType test,
    7-case ChallengeProblemFromError mapping.
  - internal/repository/postgres/acme.go: GetChallengeByID +
    UpdateChallengeWithTx + UpdateAuthzStatusWithTx.
  - internal/service/acme.go: SetValidatorPool wires the *acme.Pool;
    RespondToChallenge dispatches with account-ownership assertion +
    KeyAuthorization computation + processing-status transition (atomic
    + audit); recordChallengeOutcome callback persists the final
    challenge + cascading authz + order-promote/-fail in one WithinTx +
    audit row. 4 new metrics.
  - internal/api/handler/acme.go: Challenge handler; round-trips
    account.JWKPEM through ParseJWKFromPEM to recover the *jose.JSONWebKey
    the validator pool needs.
  - internal/api/router/router.go + openapi_parity_test.go +
    api/openapi-handler-exceptions.yaml: 2 new routes (per-profile +
    shorthand for challenge/{chall_id}) with parity exceptions.
  - cmd/server/main.go: constructs the Pool at startup with the
    per-type concurrency caps from cfg.ACMEServer; ACMEService.ValidatorPool()
    accessor exposed for the shutdown drain sequence.
  - internal/validation/ssrf.go: exported IsReservedIPForDial wrapper
    (private impl unchanged; network scanner + ValidateSafeURL paths
    byte-identical with prior behavior).
  - docs/tls.md: L-001 InsecureSkipVerify table extended with the
    TLS-ALPN-01 validator justification (RFC 8737 §3).
  - docs/acme-server.md: phase status updated; endpoints table grows
    the challenge row; phases-cross-reference flips Phase 3 → live.

Tests:
  - 80%+ coverage on the new files.
  - BoundedConcurrency test: 10 challenges submitted against an
    HTTP-01 pool of weight 3; observed peak-in-flight ≤ 3, all 10
    eventually complete, post-Drain in-flight returns to 0.
  - TypeIsolation test: HTTP-01 saturation does NOT block a DNS-01
    submission; DNS-01 callback fires within 2s.
  - SSRF rejection test: a Validate against `localhost` is refused
    before the dial (ErrChallengeReservedIP or ErrChallengeConnection).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-3".
This commit is contained in:
shankar0123
2026-05-03 14:09:00 +00:00
parent 4acd19910d
commit 7e22204ba7
15 changed files with 1407 additions and 32 deletions
+291 -18
View File
@@ -49,6 +49,10 @@ type ACMERepo interface {
GetAuthzByID(ctx context.Context, authzID string) (*domain.ACMEAuthorization, error)
ListAuthzsByOrder(ctx context.Context, orderID string) ([]*domain.ACMEAuthorization, error)
CreateChallengeWithTx(ctx context.Context, q repository.Querier, ch *domain.ACMEChallenge) error
// Phase 3 — challenge state mutation.
GetChallengeByID(ctx context.Context, challengeID string) (*domain.ACMEChallenge, error)
UpdateChallengeWithTx(ctx context.Context, q repository.Querier, ch *domain.ACMEChallenge) error
UpdateAuthzStatusWithTx(ctx context.Context, q repository.Querier, authzID string, status domain.ACMEAuthzStatus) error
}
// profileLookup is the minimum surface ACMEService needs to resolve a
@@ -98,6 +102,13 @@ type ACMEService struct {
certService *CertificateService
certRepo repository.CertificateRepository
issuerRegistry *IssuerRegistry
// Phase 3 — challenge validator pool. cmd/server/main.go
// constructs an *acme.Pool at startup with the per-type
// concurrency caps from cfg.ACMEServer; the Pool owns the 3
// semaphores + the validators. Optional via SetValidatorPool —
// when nil, RespondToChallenge returns ErrACMEChallengePoolUnconfigured.
validatorPool *acme.Pool
}
// NewACMEService constructs an ACMEService with the directory + nonce
@@ -138,6 +149,17 @@ func (s *ACMEService) SetIssuancePipeline(certSvc *CertificateService, certRepo
s.issuerRegistry = registry
}
// SetValidatorPool wires Phase 3's challenge validator pool.
// cmd/server/main.go constructs an *acme.Pool at startup with the
// per-type concurrency caps from cfg.ACMEServer. Optional —
// RespondToChallenge returns ErrACMEChallengePoolUnconfigured when
// unset (handler maps to serverInternal).
func (s *ACMEService) SetValidatorPool(pool *acme.Pool) { s.validatorPool = pool }
// ValidatorPool returns the wired pool so cmd/server/main.go's
// shutdown sequence can call Drain on it.
func (s *ACMEService) ValidatorPool() *acme.Pool { return s.validatorPool }
// Metrics returns the per-op counter snapshotter. cmd/server/main.go
// passes this into MetricsHandler so the Prometheus exposer picks up
// the per-op signals.
@@ -201,6 +223,22 @@ var ErrACMEFinalizeUnconfigured = errors.New("acme: finalize pipeline not wired
// but the validators land in Phase 3).
var ErrACMEUnsupportedAuthMode = errors.New("acme: unsupported auth mode on profile")
// Phase 3 sentinels.
// ErrACMEChallengeNotFound is returned by RespondToChallenge when the
// challenge ID in the URL doesn't match any row.
var ErrACMEChallengeNotFound = errors.New("acme: challenge not found")
// ErrACMEChallengePoolUnconfigured is returned when SetValidatorPool
// hasn't been called. Indicates a deploy-time wiring bug; mapped to
// serverInternal.
var ErrACMEChallengePoolUnconfigured = errors.New("acme: validator pool not wired (call SetValidatorPool)")
// ErrACMEChallengeWrongState is returned when RespondToChallenge sees
// a challenge already in valid/invalid (idempotent observer-side
// behavior — same shape as Phase 1b's account inactive case).
var ErrACMEChallengeWrongState = errors.New("acme: challenge is no longer in pending state")
// BuildDirectory constructs the per-profile directory document.
//
// profileID resolution:
@@ -311,6 +349,12 @@ type ACMEMetrics struct {
CertDownloadTotal atomic.Uint64
CertDownloadFailureTotal atomic.Uint64
AuthzReadTotal atomic.Uint64
// Phase 3 — challenge validation.
ChallengeRespondTotal atomic.Uint64 // dispatch acked (worker took the work)
ChallengeRespondFailTotal atomic.Uint64 // immediate rejection (already-resolved / wrong-state)
ChallengeValidateValid atomic.Uint64 // validator returned nil
ChallengeValidateInvalid atomic.Uint64 // validator returned error
}
// NewACMEMetrics returns a zeroed counter table. Concurrent callers
@@ -328,24 +372,28 @@ func (m *ACMEMetrics) bump(c *atomic.Uint64) { c.Add(1) }
// directly without per-op stringly-typed branching.
func (m *ACMEMetrics) Snapshot() map[string]uint64 {
return map[string]uint64{
"certctl_acme_directory_total": m.DirectoryTotal.Load(),
"certctl_acme_directory_failures_total": m.DirectoryFailureTotal.Load(),
"certctl_acme_new_nonce_total": m.NewNonceTotal.Load(),
"certctl_acme_new_nonce_failures_total": m.NewNonceFailureTotal.Load(),
"certctl_acme_new_account_total": m.NewAccountTotal.Load(),
"certctl_acme_new_account_failures_total": m.NewAccountFailureTotal.Load(),
"certctl_acme_new_account_idempotent_total": m.NewAccountIdempotentTotal.Load(),
"certctl_acme_update_account_total": m.UpdateAccountTotal.Load(),
"certctl_acme_update_account_failures_total": m.UpdateAccountFailureTotal.Load(),
"certctl_acme_deactivate_account_total": m.DeactivateAccountTotal.Load(),
"certctl_acme_new_order_total": m.NewOrderTotal.Load(),
"certctl_acme_new_order_failures_total": m.NewOrderFailureTotal.Load(),
"certctl_acme_new_order_rejected_total": m.NewOrderRejectedTotal.Load(),
"certctl_acme_finalize_order_total": m.FinalizeOrderTotal.Load(),
"certctl_acme_finalize_order_failures_total": m.FinalizeOrderFailureTotal.Load(),
"certctl_acme_cert_download_total": m.CertDownloadTotal.Load(),
"certctl_acme_cert_download_failures_total": m.CertDownloadFailureTotal.Load(),
"certctl_acme_authz_read_total": m.AuthzReadTotal.Load(),
"certctl_acme_directory_total": m.DirectoryTotal.Load(),
"certctl_acme_directory_failures_total": m.DirectoryFailureTotal.Load(),
"certctl_acme_new_nonce_total": m.NewNonceTotal.Load(),
"certctl_acme_new_nonce_failures_total": m.NewNonceFailureTotal.Load(),
"certctl_acme_new_account_total": m.NewAccountTotal.Load(),
"certctl_acme_new_account_failures_total": m.NewAccountFailureTotal.Load(),
"certctl_acme_new_account_idempotent_total": m.NewAccountIdempotentTotal.Load(),
"certctl_acme_update_account_total": m.UpdateAccountTotal.Load(),
"certctl_acme_update_account_failures_total": m.UpdateAccountFailureTotal.Load(),
"certctl_acme_deactivate_account_total": m.DeactivateAccountTotal.Load(),
"certctl_acme_new_order_total": m.NewOrderTotal.Load(),
"certctl_acme_new_order_failures_total": m.NewOrderFailureTotal.Load(),
"certctl_acme_new_order_rejected_total": m.NewOrderRejectedTotal.Load(),
"certctl_acme_finalize_order_total": m.FinalizeOrderTotal.Load(),
"certctl_acme_finalize_order_failures_total": m.FinalizeOrderFailureTotal.Load(),
"certctl_acme_cert_download_total": m.CertDownloadTotal.Load(),
"certctl_acme_cert_download_failures_total": m.CertDownloadFailureTotal.Load(),
"certctl_acme_authz_read_total": m.AuthzReadTotal.Load(),
"certctl_acme_challenge_respond_total": m.ChallengeRespondTotal.Load(),
"certctl_acme_challenge_respond_failures_total": m.ChallengeRespondFailTotal.Load(),
"certctl_acme_challenge_validate_valid_total": m.ChallengeValidateValid.Load(),
"certctl_acme_challenge_validate_invalid_total": m.ChallengeValidateInvalid.Load(),
}
}
@@ -1084,3 +1132,228 @@ func identifierStrings(ids []domain.ACMEIdentifier) []string {
}
return out
}
// --- Phase 3 — challenge dispatch + validator callback -----------------
// ChallengeResponseShape is what RespondToChallenge returns to the
// handler: the post-dispatch challenge row (status=processing) so the
// handler can render it via acme.MarshalAuthorization-equivalent. The
// validator goroutine writes the final status (valid/invalid) as a
// callback after dispatch completes — clients fetching the challenge
// via authz GET get the eventual state.
type ChallengeResponseShape struct {
Challenge *domain.ACMEChallenge
}
// RespondToChallenge handles POST /acme/profile/<id>/challenge/<chall_id>
// per RFC 8555 §7.5.1.
//
// Behavior:
// - Look up the challenge + parent authz + parent order; assert the
// account owns the order.
// - If the challenge is already valid/invalid → idempotent return.
// - If pending: transition to processing (atomic via WithinTx + audit).
// - Submit to the validator pool with an onComplete callback that
// transitions the challenge to valid/invalid in another WithinTx
// (and cascades the parent authz status).
// - Return the challenge in its current (processing) state; the
// client polls authz/challenge for the eventual outcome.
func (s *ACMEService) RespondToChallenge(
ctx context.Context,
accountID, challengeID string,
accountJWK *jose.JSONWebKey,
) (*domain.ACMEChallenge, error) {
if s.tx == nil || s.auditService == nil {
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
return nil, fmt.Errorf("acme: respond-to-challenge requires SetTransactor + SetAuditService")
}
if s.validatorPool == nil {
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
return nil, ErrACMEChallengePoolUnconfigured
}
ch, err := s.repo.GetChallengeByID(ctx, challengeID)
if err != nil {
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
if errors.Is(err, repository.ErrNotFound) {
return nil, ErrACMEChallengeNotFound
}
return nil, fmt.Errorf("acme: lookup challenge: %w", err)
}
// Idempotent re-POST: already valid/invalid → just return.
if ch.Status == domain.ACMEChallengeStatusValid || ch.Status == domain.ACMEChallengeStatusInvalid {
s.metrics.bump(&s.metrics.ChallengeRespondTotal)
return ch, nil
}
if ch.Status == domain.ACMEChallengeStatusProcessing {
// In-flight. Return the row as-is.
s.metrics.bump(&s.metrics.ChallengeRespondTotal)
return ch, nil
}
// Confirm the requesting account owns the parent authz/order.
authz, err := s.repo.GetAuthzByID(ctx, ch.AuthzID)
if err != nil {
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
return nil, fmt.Errorf("acme: lookup parent authz: %w", err)
}
order, err := s.repo.GetOrderByID(ctx, authz.OrderID)
if err != nil {
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
return nil, fmt.Errorf("acme: lookup parent order: %w", err)
}
if order.AccountID != accountID {
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
return nil, ErrACMEOrderUnauthorized
}
// Compute the key authorization the validator needs.
expected, err := acme.KeyAuthorization(ch.Token, accountJWK)
if err != nil {
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
return nil, fmt.Errorf("acme: key authorization: %w", err)
}
// Transition challenge → processing (atomic with audit row).
ch.Status = domain.ACMEChallengeStatusProcessing
if err := s.tx.WithinTx(ctx, func(q repository.Querier) error {
if err := s.repo.UpdateChallengeWithTx(ctx, q, ch); err != nil {
return err
}
return s.auditService.RecordEventWithTx(ctx, q,
fmt.Sprintf("acme:%s", accountID), domain.ActorTypeUser,
"acme_challenge_processing", "acme_challenge", ch.ChallengeID,
map[string]interface{}{
"authz_id": ch.AuthzID,
"type": string(ch.Type),
"identifier": authz.Identifier.Value,
})
}); err != nil {
s.metrics.bump(&s.metrics.ChallengeRespondFailTotal)
return nil, err
}
// Submit to the pool. The onComplete callback persists the final
// challenge status + cascades the parent authz status. We use a
// fresh background context here so the callback's WithinTx isn't
// canceled when the originating HTTP request returns.
bgctx := context.Background()
chSnapshot := *ch
authzSnapshot := *authz
identifier := authz.Identifier.Value
s.validatorPool.Submit(bgctx, string(ch.Type), identifier, ch.Token, expected, func(verr error) {
s.recordChallengeOutcome(bgctx, accountID, &chSnapshot, &authzSnapshot, verr)
})
s.metrics.bump(&s.metrics.ChallengeRespondTotal)
return ch, nil
}
// recordChallengeOutcome is the validator-pool callback. Persists the
// challenge's final status + cascades the parent authz status.
//
// Authz cascade: if the challenge succeeded, the authz becomes valid
// (RFC 8555 §7.1.6: any one challenge passing makes the authz valid).
// If the challenge failed, the authz becomes invalid only if no other
// pending challenges remain (Phase 3 minimal-viable path: we mark the
// authz invalid on first failure since Phase 3 emits 1 challenge per
// authz; Phase 4+ extending to multi-challenge-per-authz revisits this).
func (s *ACMEService) recordChallengeOutcome(
ctx context.Context,
accountID string,
ch *domain.ACMEChallenge,
authz *domain.ACMEAuthorization,
verr error,
) {
now := time.Now().UTC()
var newAuthzStatus domain.ACMEAuthzStatus
if verr == nil {
ch.Status = domain.ACMEChallengeStatusValid
ch.ValidatedAt = &now
ch.Error = nil
newAuthzStatus = domain.ACMEAuthzStatusValid
s.metrics.bump(&s.metrics.ChallengeValidateValid)
} else {
ch.Status = domain.ACMEChallengeStatusInvalid
if p := acme.ChallengeProblemFromError(string(ch.Type), verr); p != nil {
ch.Error = &domain.ACMEProblem{
Type: p.Type,
Detail: p.Detail,
Status: p.Status,
}
}
newAuthzStatus = domain.ACMEAuthzStatusInvalid
s.metrics.bump(&s.metrics.ChallengeValidateInvalid)
}
auditDetails := map[string]interface{}{
"authz_id": ch.AuthzID,
"type": string(ch.Type),
"identifier": authz.Identifier.Value,
"valid": verr == nil,
}
if verr != nil {
auditDetails["error"] = verr.Error()
}
_ = s.tx.WithinTx(ctx, func(q repository.Querier) error {
if err := s.repo.UpdateChallengeWithTx(ctx, q, ch); err != nil {
return err
}
if err := s.repo.UpdateAuthzStatusWithTx(ctx, q, ch.AuthzID, newAuthzStatus); err != nil {
return err
}
// Cascade: if the authz turned valid, see whether the order's
// authzs are now ALL valid; flip order to ready if so.
// Read-after-write to confirm.
authzs, err := s.repo.ListAuthzsByOrder(ctx, authz.OrderID)
if err != nil {
return err
}
allValid := len(authzs) > 0
anyInvalid := false
for _, a := range authzs {
if a.AuthzID == ch.AuthzID {
if newAuthzStatus != domain.ACMEAuthzStatusValid {
allValid = false
}
if newAuthzStatus == domain.ACMEAuthzStatusInvalid {
anyInvalid = true
}
continue
}
if a.Status != domain.ACMEAuthzStatusValid {
allValid = false
}
if a.Status == domain.ACMEAuthzStatusInvalid {
anyInvalid = true
}
}
order, err := s.repo.GetOrderByID(ctx, authz.OrderID)
if err != nil {
return err
}
switch {
case allValid && order.Status == domain.ACMEOrderStatusPending:
order.Status = domain.ACMEOrderStatusReady
if err := s.repo.UpdateOrderWithTx(ctx, q, order); err != nil {
return err
}
case anyInvalid && order.Status == domain.ACMEOrderStatusPending:
order.Status = domain.ACMEOrderStatusInvalid
order.Error = &domain.ACMEProblem{
Type: "urn:ietf:params:acme:error:incorrectResponse",
Detail: "one or more authorizations failed",
Status: 403,
}
if err := s.repo.UpdateOrderWithTx(ctx, q, order); err != nil {
return err
}
}
return s.auditService.RecordEventWithTx(ctx, q,
fmt.Sprintf("acme:%s", accountID), domain.ActorTypeUser,
"acme_challenge_completed", "acme_challenge", ch.ChallengeID,
auditDetails)
})
}
+9
View File
@@ -137,6 +137,15 @@ func (f *fakeACMERepo) ListAuthzsByOrder(ctx context.Context, orderID string) ([
func (f *fakeACMERepo) CreateChallengeWithTx(ctx context.Context, q repository.Querier, ch *domain.ACMEChallenge) error {
return nil
}
func (f *fakeACMERepo) GetChallengeByID(ctx context.Context, challengeID string) (*domain.ACMEChallenge, error) {
return nil, repository.ErrNotFound
}
func (f *fakeACMERepo) UpdateChallengeWithTx(ctx context.Context, q repository.Querier, ch *domain.ACMEChallenge) error {
return nil
}
func (f *fakeACMERepo) UpdateAuthzStatusWithTx(ctx context.Context, q repository.Querier, authzID string, status domain.ACMEAuthzStatus) error {
return nil
}
// fakeTransactor is the repository.Transactor stand-in: runs fn
// against the supplied querier (we just pass nil — fakes ignore it).