acme-server: HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (Phase 3/7)

Wires up the actual challenge-validation machinery so profiles in
acme_auth_mode='challenge' resolve end-to-end. After this commit,
cert-manager 1.15+ with `solver: http01: ingress` against a
challenge-mode profile completes a real HTTP-01 flow and gets a cert.
DNS-01 + TLS-ALPN-01 share the same code path with the appropriate
validator selection.

Architecture (the load-bearing parts):
  - 3 separate semaphore-bounded worker pools (one per challenge type),
    so HTTP-01 and DNS-01 can't starve each other under load. Default
    weight 10 per type; tunable via CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY,
    DNS01_CONCURRENCY, TLSALPN01_CONCURRENCY.
  - 30s per-challenge timeout (configurable via PoolConfig.PerChallengeTimeout).
  - HTTP-01 validator runs validation.IsReservedIPForDial (newly
    exported wrapper preserving the existing private impl byte-for-byte
    for the network scanner + ValidateSafeURL paths) on the resolved
    IP — both at the initial dial and every redirect hop. SSRF probes
    into private IP space are refused before the connect.
  - DNS-01 validator uses a dedicated resolver pointed at
    CERTCTL_ACME_SERVER_DNS01_RESOLVER (default 8.8.8.8:53) — does
    NOT use the system resolver to keep behavior deterministic across
    deployments. Wildcard handling: `*.example.com` queries
    _acme-challenge.example.com.
  - TLS-ALPN-01 validator (RFC 8737) connects with ALPN `acme-tls/1`,
    inspects the id-pe-acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31),
    asserts the ASN.1 OCTET STRING value equals SHA-256 of the key
    authorization. Cert chain is intentionally NOT validated
    (InsecureSkipVerify=true is correct per RFC 8737 — the proof is
    in the extension, not the chain). Documented in docs/tls.md L-001
    table + the //nolint:gosec comment carries the justification.
    SSRF guard: same posture as HTTP-01.
  - Validation is asynchronous: handler accepts the POST and returns
    200 immediately with status=processing; the worker-pool fires a
    callback that updates challenge → authz → order in a fresh
    background-context WithinTx. The order auto-promotes to `ready`
    when ALL authzs become valid; auto-fails to `invalid` when ANY
    authz becomes invalid.

What ships:
  - internal/api/acme/challenge.go: KeyAuthorization (RFC 8555 §8.1) +
    DNS01TXTRecordValue (§8.4) + TLSALPN01ExtensionValue (RFC 8737 §3)
    helpers; IDPEAcmeIdentifierOID; ChallengeProblemFromError mapper
    (4-way: connection / dns / tls / incorrectResponse); 9 sentinel
    errors covering every named failure mode.
  - internal/api/acme/validators.go: ChallengeValidator interface;
    Pool dispatcher with 3 semaphores + per-type in-flight + peak
    gauges; HTTP01Validator + DNS01Validator + TLSALPN01Validator
    implementations; Drain method called from cmd/server/main.go's
    shutdown sequence.
  - internal/api/acme/validators_test.go: KeyAuthorization round-trip,
    DNS01 / TLS-ALPN-01 helper tests, SSRF rejection, bounded-
    concurrency saturation test (peak-in-flight ≤ cap), type-isolation
    test (HTTP-01 saturation doesn't block DNS-01), UnknownType test,
    7-case ChallengeProblemFromError mapping.
  - internal/repository/postgres/acme.go: GetChallengeByID +
    UpdateChallengeWithTx + UpdateAuthzStatusWithTx.
  - internal/service/acme.go: SetValidatorPool wires the *acme.Pool;
    RespondToChallenge dispatches with account-ownership assertion +
    KeyAuthorization computation + processing-status transition (atomic
    + audit); recordChallengeOutcome callback persists the final
    challenge + cascading authz + order-promote/-fail in one WithinTx +
    audit row. 4 new metrics.
  - internal/api/handler/acme.go: Challenge handler; round-trips
    account.JWKPEM through ParseJWKFromPEM to recover the *jose.JSONWebKey
    the validator pool needs.
  - internal/api/router/router.go + openapi_parity_test.go +
    api/openapi-handler-exceptions.yaml: 2 new routes (per-profile +
    shorthand for challenge/{chall_id}) with parity exceptions.
  - cmd/server/main.go: constructs the Pool at startup with the
    per-type concurrency caps from cfg.ACMEServer; ACMEService.ValidatorPool()
    accessor exposed for the shutdown drain sequence.
  - internal/validation/ssrf.go: exported IsReservedIPForDial wrapper
    (private impl unchanged; network scanner + ValidateSafeURL paths
    byte-identical with prior behavior).
  - docs/tls.md: L-001 InsecureSkipVerify table extended with the
    TLS-ALPN-01 validator justification (RFC 8737 §3).
  - docs/acme-server.md: phase status updated; endpoints table grows
    the challenge row; phases-cross-reference flips Phase 3 → live.

Tests:
  - 80%+ coverage on the new files.
  - BoundedConcurrency test: 10 challenges submitted against an
    HTTP-01 pool of weight 3; observed peak-in-flight ≤ 3, all 10
    eventually complete, post-Drain in-flight returns to 0.
  - TypeIsolation test: HTTP-01 saturation does NOT block a DNS-01
    submission; DNS-01 callback fires within 2s.
  - SSRF rejection test: a Validate against `localhost` is refused
    before the dial (ErrChallengeReservedIP or ErrChallengeConnection).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-3".
This commit is contained in:
shankar0123
2026-05-03 14:09:00 +00:00
parent 4acd19910d
commit 7e22204ba7
15 changed files with 1407 additions and 32 deletions
+93 -1
View File
@@ -49,6 +49,8 @@ type ACMEService interface {
ListAuthzsByOrder(ctx context.Context, orderID string) ([]*domain.ACMEAuthorization, error)
FinalizeOrder(ctx context.Context, accountID, orderID, profileID string, csr *x509.CertificateRequest, csrPEM string) (*service.FinalizeOrderResult, error)
LookupCertificate(ctx context.Context, certID, accountID string) (string, error)
// Phase 3 — challenge validation.
RespondToChallenge(ctx context.Context, accountID, challengeID string, accountJWK *jose.JSONWebKey) (*domain.ACMEChallenge, error)
}
// ACMEHandler exposes the ACME server's RFC 8555 endpoints under the
@@ -211,8 +213,20 @@ func writeServiceError(w http.ResponseWriter, err error) {
Detail: "order is not in the `ready` state; complete authorizations first",
Status: http.StatusForbidden,
})
case errors.Is(err, service.ErrACMEUnsupportedAuthMode), errors.Is(err, service.ErrACMEFinalizeUnconfigured):
case errors.Is(err, service.ErrACMEUnsupportedAuthMode), errors.Is(err, service.ErrACMEFinalizeUnconfigured), errors.Is(err, service.ErrACMEChallengePoolUnconfigured):
acme.WriteProblem(w, acme.ServerInternal("ACME server is not fully configured; contact the operator"))
case errors.Is(err, service.ErrACMEChallengeNotFound):
acme.WriteProblem(w, acme.Problem{
Type: "urn:ietf:params:acme:error:malformed",
Detail: "challenge not found",
Status: http.StatusNotFound,
})
case errors.Is(err, service.ErrACMEChallengeWrongState):
acme.WriteProblem(w, acme.Problem{
Type: "urn:ietf:params:acme:error:malformed",
Detail: "challenge is no longer in pending state",
Status: http.StatusBadRequest,
})
default:
// Avoid leaking internal error text per master-prompt
// criterion #10 (operator-actionable errors with no info
@@ -793,3 +807,81 @@ func parseOptionalTime(s string) *time.Time {
}
return &t
}
// Challenge handles POST /acme/profile/{id}/challenge/{chall_id}
// (RFC 8555 §7.5.1). The client posts an empty body (modern ACME) or
// a `{}` payload to indicate "I'm ready for you to validate this
// challenge." The handler dispatches the validator-pool work + returns
// the challenge in its current (processing) state. Clients poll authz
// or challenge for the eventual outcome.
//
// Phase 3: account JWK is needed to compute the key authorization. The
// JWS verifier returns the registered account's stored JWKPEM in the
// VerifiedRequest.Account; we round-trip that PEM through ParseJWKFromPEM
// to get the *jose.JSONWebKey the validator pool needs.
func (h ACMEHandler) Challenge(w http.ResponseWriter, r *http.Request) {
profileID := r.PathValue("id")
challengeID := r.PathValue("chall_id")
requestURL := h.requestURL(r)
body, err := io.ReadAll(io.LimitReader(r.Body, MaxJWSBodyBytes+1))
if err != nil {
acme.WriteProblem(w, acme.Malformed("could not read request body"))
return
}
if len(body) > MaxJWSBodyBytes {
acme.WriteProblem(w, acme.Malformed("request body too large"))
return
}
verified, err := h.svc.VerifyJWS(r.Context(), body, requestURL, false /*expectNewAccount*/, h.accountKID(r, profileID))
if err != nil {
acme.WriteProblem(w, acme.MapJWSErrorToProblem(err))
return
}
if verified.Account == nil {
acme.WriteProblem(w, acme.MapJWSErrorToProblem(acme.ErrJWSAccountNotFound))
return
}
// Reconstruct the account's public JWK from its stored PEM. This
// is what the validator pool needs to compute key authorizations.
jwk, err := acme.ParseJWKFromPEM(verified.Account.JWKPEM)
if err != nil {
acme.WriteProblem(w, acme.ServerInternal("could not parse stored account JWK"))
return
}
ch, err := h.svc.RespondToChallenge(r.Context(), verified.Account.AccountID, challengeID, jwk)
if err != nil {
writeServiceError(w, err)
return
}
if nonce, err := h.svc.IssueNonce(r.Context()); err == nil {
w.Header().Set("Replay-Nonce", nonce)
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_ = json.NewEncoder(w).Encode(marshalChallengeResponse(ch, h.challengeURLBuilder(r, profileID)))
}
// marshalChallengeResponse renders a single ACMEChallenge in the
// RFC 8555 §8 wire shape. Distinct from MarshalAuthorization (which
// embeds challenges in an authz wrapper); the challenge endpoint
// returns one challenge directly per RFC 8555 §7.5.1.
func marshalChallengeResponse(ch *domain.ACMEChallenge, urlBuilder func(string) string) acme.ChallengeResponseJSON {
out := acme.ChallengeResponseJSON{
Type: string(ch.Type),
URL: urlBuilder(ch.ChallengeID),
Status: string(ch.Status),
Token: ch.Token,
}
if ch.ValidatedAt != nil {
out.Validated = ch.ValidatedAt.UTC().Format(time.RFC3339)
}
if ch.Error != nil {
out.Error = &acme.Problem{Type: ch.Error.Type, Detail: ch.Error.Detail, Status: ch.Error.Status}
}
return out
}
+10
View File
@@ -40,6 +40,8 @@ type mockACMEService struct {
ListAuthzsByOrderFn func(ctx context.Context, orderID string) ([]*domain.ACMEAuthorization, error)
FinalizeOrderFn func(ctx context.Context, accountID, orderID, profileID string, csr *x509.CertificateRequest, csrPEM string) (*service.FinalizeOrderResult, error)
LookupCertificateFn func(ctx context.Context, certID, accountID string) (string, error)
// Phase 3.
RespondToChallengeFn func(ctx context.Context, accountID, challengeID string, accountJWK *jose.JSONWebKey) (*domain.ACMEChallenge, error)
}
func (m *mockACMEService) BuildDirectory(ctx context.Context, profileID, baseURL string) (*acme.Directory, error) {
@@ -133,6 +135,13 @@ func (m *mockACMEService) LookupCertificate(ctx context.Context, certID, account
return "", errors.New("LookupCertificate not stubbed")
}
func (m *mockACMEService) RespondToChallenge(ctx context.Context, accountID, challengeID string, accountJWK *jose.JSONWebKey) (*domain.ACMEChallenge, error) {
if m.RespondToChallengeFn != nil {
return m.RespondToChallengeFn(ctx, accountID, challengeID, accountJWK)
}
return nil, errors.New("RespondToChallenge not stubbed")
}
// newACMETestServer wires the ACMEHandler against the mock + a stdlib
// ServeMux configured exactly the way internal/api/router/router.go
// does it in production. Routes:
@@ -156,6 +165,7 @@ func newACMETestServer(t *testing.T, mock *mockACMEService) *httptest.Server {
mux.HandleFunc("POST /acme/profile/{id}/order/{ord_id}", h.Order)
mux.HandleFunc("POST /acme/profile/{id}/order/{ord_id}/finalize", h.OrderFinalize)
mux.HandleFunc("POST /acme/profile/{id}/authz/{authz_id}", h.Authz)
mux.HandleFunc("POST /acme/profile/{id}/challenge/{chall_id}", h.Challenge)
mux.HandleFunc("POST /acme/profile/{id}/cert/{cert_id}", h.Cert)
mux.HandleFunc("GET /acme/directory", h.Directory)
mux.HandleFunc("HEAD /acme/new-nonce", h.NewNonce)