acme-server: HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (Phase 3/7)

Wires up the actual challenge-validation machinery so profiles in
acme_auth_mode='challenge' resolve end-to-end. After this commit,
cert-manager 1.15+ with `solver: http01: ingress` against a
challenge-mode profile completes a real HTTP-01 flow and gets a cert.
DNS-01 + TLS-ALPN-01 share the same code path with the appropriate
validator selection.

Architecture (the load-bearing parts):
  - 3 separate semaphore-bounded worker pools (one per challenge type),
    so HTTP-01 and DNS-01 can't starve each other under load. Default
    weight 10 per type; tunable via CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY,
    DNS01_CONCURRENCY, TLSALPN01_CONCURRENCY.
  - 30s per-challenge timeout (configurable via PoolConfig.PerChallengeTimeout).
  - HTTP-01 validator runs validation.IsReservedIPForDial (newly
    exported wrapper preserving the existing private impl byte-for-byte
    for the network scanner + ValidateSafeURL paths) on the resolved
    IP — both at the initial dial and every redirect hop. SSRF probes
    into private IP space are refused before the connect.
  - DNS-01 validator uses a dedicated resolver pointed at
    CERTCTL_ACME_SERVER_DNS01_RESOLVER (default 8.8.8.8:53) — does
    NOT use the system resolver to keep behavior deterministic across
    deployments. Wildcard handling: `*.example.com` queries
    _acme-challenge.example.com.
  - TLS-ALPN-01 validator (RFC 8737) connects with ALPN `acme-tls/1`,
    inspects the id-pe-acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31),
    asserts the ASN.1 OCTET STRING value equals SHA-256 of the key
    authorization. Cert chain is intentionally NOT validated
    (InsecureSkipVerify=true is correct per RFC 8737 — the proof is
    in the extension, not the chain). Documented in docs/tls.md L-001
    table + the //nolint:gosec comment carries the justification.
    SSRF guard: same posture as HTTP-01.
  - Validation is asynchronous: handler accepts the POST and returns
    200 immediately with status=processing; the worker-pool fires a
    callback that updates challenge → authz → order in a fresh
    background-context WithinTx. The order auto-promotes to `ready`
    when ALL authzs become valid; auto-fails to `invalid` when ANY
    authz becomes invalid.

What ships:
  - internal/api/acme/challenge.go: KeyAuthorization (RFC 8555 §8.1) +
    DNS01TXTRecordValue (§8.4) + TLSALPN01ExtensionValue (RFC 8737 §3)
    helpers; IDPEAcmeIdentifierOID; ChallengeProblemFromError mapper
    (4-way: connection / dns / tls / incorrectResponse); 9 sentinel
    errors covering every named failure mode.
  - internal/api/acme/validators.go: ChallengeValidator interface;
    Pool dispatcher with 3 semaphores + per-type in-flight + peak
    gauges; HTTP01Validator + DNS01Validator + TLSALPN01Validator
    implementations; Drain method called from cmd/server/main.go's
    shutdown sequence.
  - internal/api/acme/validators_test.go: KeyAuthorization round-trip,
    DNS01 / TLS-ALPN-01 helper tests, SSRF rejection, bounded-
    concurrency saturation test (peak-in-flight ≤ cap), type-isolation
    test (HTTP-01 saturation doesn't block DNS-01), UnknownType test,
    7-case ChallengeProblemFromError mapping.
  - internal/repository/postgres/acme.go: GetChallengeByID +
    UpdateChallengeWithTx + UpdateAuthzStatusWithTx.
  - internal/service/acme.go: SetValidatorPool wires the *acme.Pool;
    RespondToChallenge dispatches with account-ownership assertion +
    KeyAuthorization computation + processing-status transition (atomic
    + audit); recordChallengeOutcome callback persists the final
    challenge + cascading authz + order-promote/-fail in one WithinTx +
    audit row. 4 new metrics.
  - internal/api/handler/acme.go: Challenge handler; round-trips
    account.JWKPEM through ParseJWKFromPEM to recover the *jose.JSONWebKey
    the validator pool needs.
  - internal/api/router/router.go + openapi_parity_test.go +
    api/openapi-handler-exceptions.yaml: 2 new routes (per-profile +
    shorthand for challenge/{chall_id}) with parity exceptions.
  - cmd/server/main.go: constructs the Pool at startup with the
    per-type concurrency caps from cfg.ACMEServer; ACMEService.ValidatorPool()
    accessor exposed for the shutdown drain sequence.
  - internal/validation/ssrf.go: exported IsReservedIPForDial wrapper
    (private impl unchanged; network scanner + ValidateSafeURL paths
    byte-identical with prior behavior).
  - docs/tls.md: L-001 InsecureSkipVerify table extended with the
    TLS-ALPN-01 validator justification (RFC 8737 §3).
  - docs/acme-server.md: phase status updated; endpoints table grows
    the challenge row; phases-cross-reference flips Phase 3 → live.

Tests:
  - 80%+ coverage on the new files.
  - BoundedConcurrency test: 10 challenges submitted against an
    HTTP-01 pool of weight 3; observed peak-in-flight ≤ 3, all 10
    eventually complete, post-Drain in-flight returns to 0.
  - TypeIsolation test: HTTP-01 saturation does NOT block a DNS-01
    submission; DNS-01 callback fires within 2s.
  - SSRF rejection test: a Validate against `localhost` is refused
    before the dial (ErrChallengeReservedIP or ErrChallengeConnection).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-3".
This commit is contained in:
shankar0123
2026-05-03 14:09:00 +00:00
parent 45fae9952a
commit 9bc845304e
15 changed files with 1407 additions and 32 deletions
+67
View File
@@ -471,6 +471,73 @@ func (r *ACMERepository) ListChallengesByAuthz(ctx context.Context, authzID stri
return out, rows.Err()
}
// GetChallengeByID retrieves a challenge row by ID.
func (r *ACMERepository) GetChallengeByID(ctx context.Context, challengeID string) (*domain.ACMEChallenge, error) {
row := r.db.QueryRowContext(ctx, `
SELECT challenge_id, authz_id, type, status, token, validated_at, error, created_at
FROM acme_challenges WHERE challenge_id = $1
`, challengeID)
return scanACMEChallenge(row)
}
// UpdateChallengeWithTx persists changes to a challenge's mutable
// fields (status, validated_at, error). Used by the Phase 3 validator
// callback after a challenge attempt completes.
func (r *ACMERepository) UpdateChallengeWithTx(ctx context.Context, q repository.Querier, ch *domain.ACMEChallenge) error {
var (
validatedAt interface{}
errBlob interface{}
)
if ch.ValidatedAt != nil {
validatedAt = *ch.ValidatedAt
}
if ch.Error != nil {
b, err := jsonMarshalACME(ch.Error)
if err != nil {
return fmt.Errorf("acme: marshal challenge error: %w", err)
}
errBlob = b
}
res, err := q.ExecContext(ctx, `
UPDATE acme_challenges
SET status = $2, validated_at = $3, error = $4
WHERE challenge_id = $1
`, ch.ChallengeID, string(ch.Status), validatedAt, errBlob)
if err != nil {
return fmt.Errorf("acme: update challenge: %w", err)
}
n, err := res.RowsAffected()
if err != nil {
return fmt.Errorf("acme: update challenge rows: %w", err)
}
if n == 0 {
return fmt.Errorf("challenge not found: %w", repository.ErrNotFound)
}
return nil
}
// UpdateAuthzStatusWithTx persists an authz status transition. Used
// by the Phase 3 validator callback to flip an authz to valid or
// invalid based on the challenge outcome.
func (r *ACMERepository) UpdateAuthzStatusWithTx(ctx context.Context, q repository.Querier, authzID string, status domain.ACMEAuthzStatus) error {
res, err := q.ExecContext(ctx, `
UPDATE acme_authorizations
SET status = $2, updated_at = NOW()
WHERE authz_id = $1
`, authzID, string(status))
if err != nil {
return fmt.Errorf("acme: update authz status: %w", err)
}
n, err := res.RowsAffected()
if err != nil {
return fmt.Errorf("acme: update authz status rows: %w", err)
}
if n == 0 {
return fmt.Errorf("authz not found: %w", repository.ErrNotFound)
}
return nil
}
// scanACMEOrder parses an acme_orders row.
func scanACMEOrder(row interface{ Scan(...interface{}) error }) (*domain.ACMEOrder, error) {
var (