acme-server: key rollover + revocation + ARI (Phase 4/7)

Closes the RFC 8555 + RFC 9773 surface beyond the issuance happy-path:
  - POST /acme/profile/<id>/key-change   (RFC 8555 §7.3.5)
  - POST /acme/profile/<id>/revoke-cert  (RFC 8555 §7.6)
  - GET  /acme/profile/<id>/renewal-info/<cert-id>  (RFC 9773 ARI)

After this commit, ACME clients can rotate account keys, revoke certs
through the ACME surface (rather than only via the certctl GUI/API),
and fetch ARI for proactive renewal scheduling.

Architecture:
  - Key rollover: outer JWS verified against the registered account key
    (existing kid path); the inner JWS — embedded as the outer's payload
    — verified against the embedded NEW jwk in a new dedicated routine
    (ParseAndVerifyKeyChangeInner) that enforces RFC 8555 §7.3.5
    inner-only invariants: MUST use jwk + MUST NOT use kid, payload
    .account == outer.kid, payload.oldKey thumbprint-equals registered.
    A single WithinTx swaps the stored thumbprint+pem and writes the
    audit row. Concurrent-rollover safety via SELECT…FOR UPDATE on the
    conflicting account row in UpdateAccountJWKWithTx; the loser
    observes the winner's new thumbprint and is told to retry (409).
  - Revocation: two auth paths. kid → AccountOwnsCertificate single-
    indexed COUNT lookup over acme_orders. jwk → constant-time RFC 7638
    thumbprint compare against the cert's pubkey. Both paths route
    through service.RevocationSvc.RevokeCertificateWithActor so the
    existing CRL/OCSP refresh + audit + metrics pipeline applies. RFC
    5280 §5.3.1 numeric reason codes clamp to certctl's
    domain.ValidRevocationReasons; codes 8 (removeFromCRL) + 10
    (aACompromise) clamp to 'unspecified' since they aren't in the set.
  - ARI is GET-only and unauth per RFC 9773 §4. Cert-id wire shape is
    base64url(AKI).base64url(serial); ParseARICertID strict-decodes,
    SerialHex emits the canonical certctl-shape lowercase-no-leading-
    zeros hex used in certificate_versions.serial_number.
    ComputeRenewalWindow has 3 branches: bound RenewalPolicy →
    [notAfter - days, notAfter - days/2]; no policy → last 33% of
    validity; past expiry → [now, now + 1d] (renew immediately).
    Retry-After honors CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL.

What ships:
  - internal/api/acme/{keychange,ari}.go (+ phase4_test.go: 15 tests).
  - internal/api/acme/order.go: RevokeCertRequest wire shape.
  - internal/api/handler/acme.go: KeyChange, RevokeCert, RenewalInfo
    + 11 new writeServiceError mappings.
  - internal/repository/postgres/acme.go: UpdateAccountJWKWithTx (FOR
    UPDATE + expectedOldThumbprint precondition; ErrACMEAccountKey-
    ConcurrentUpdate sentinel) + AccountOwnsCertificate.
  - internal/service/acme.go: RotateAccountKey + RevokeCert +
    RenewalInfo; CertificateRevoker + RenewalPolicyLookup interfaces;
    SetRevocationDelegate + SetRenewalPolicyLookup wiring; 11 new
    sentinels; 6 new metrics.
  - internal/service/acme_phase4_test.go: service-layer tests for
    RotateAccountKey (happy + duplicate-key) + RevokeCert (kid mismatch
    + jwk mismatch + jwk happy + already-revoked + reason-clamping) +
    RenewalInfo (disabled + bad cert-id).
  - internal/api/router/router.go: 6 new register calls (3 per-profile
    + 3 shorthand). Router parity exceptions extended in lockstep
    (in-tree SpecParityExceptions + CI-only openapi-handler-exceptions
    .yaml).
  - cmd/server/main.go: SetRevocationDelegate(revocationSvc) +
    SetRenewalPolicyLookup(renewalPolicyRepo) at startup.
  - internal/config/config.go: CERTCTL_ACME_SERVER_ARI_ENABLED (default
    true) + CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL (default 6h);
    BuildDirectory's ariEnabled flag now flips on under
    cfg.ARIEnabled.
  - docs/acme-server.md: phase status flipped to Phase 4; endpoints
    table grows 6 rows (3 per-profile + 3 shorthand); FAQ section
    appended explaining how to rotate keys, revoke certs, and consume
    ARI.

Tests:
  - 'go vet ./...' clean across the repo.
  - 'go test -short -count=1 ./...' green across every package.
  - phase4_test.go covers: keychange happy-path + 5 negatives +
    MapKeyChangeErrorToProblem coverage; ARI cert-id round-trip + 6
    malformed cases + BuildARICertID from a generated cert; window-
    math 3 branches.
  - service-layer tests confirm: RotateAccountKey atomically swaps the
    thumbprint (verifies persisted state) and rejects duplicate keys;
    RevokeCert routes through the stub RevocationSvc with the right
    actor string + reason on the jwk path, rejects mismatched keys,
    rejects already-revoked certs, clamps reason codes correctly;
    RenewalInfo respects ARIEnabled + cert-id format.

Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-4'.
This commit is contained in:
shankar0123
2026-05-03 16:51:06 +00:00
parent 62513ad12f
commit 4dc8d3fa5b
16 changed files with 2210 additions and 17 deletions
+99
View File
@@ -652,6 +652,105 @@ func jsonUnmarshalACME(data []byte, v interface{}) error {
return json.Unmarshal(data, v)
}
// --- Phase 4 — key rollover + revocation auth helper ------------------
// ErrACMEAccountKeyConcurrentUpdate is returned by UpdateAccountJWKWithTx
// when the row's stored thumbprint no longer matches expectedOldThumbprint.
// This is the "concurrent rollover" race signal: two requests both
// pre-validated their inner JWS against the SAME current key, but only
// one of them got into the WithinTx first. The loser's expectedOldThumb-
// print is now stale → the row's current thumbprint is the WINNER's NEW
// thumbprint. Caller maps to RFC 8555 §7.3.5 "key already exists".
var ErrACMEAccountKeyConcurrentUpdate = errors.New("acme: account key was rotated concurrently; re-fetch and retry")
// UpdateAccountJWKWithTx atomically swaps an account row's JWK
// (thumbprint + PEM) using a SELECT … FOR UPDATE guard plus an
// expectedOldThumbprint precondition.
//
// The row is locked for update; the precondition check then runs
// against the locked row inside the same tx. If a concurrent rollover
// committed first, this caller observes the new thumbprint and returns
// ErrACMEAccountKeyConcurrentUpdate (caller maps to 409). Postgres's
// row-level lock guarantees the two concurrent transactions serialize.
//
// Returns repository.ErrNotFound when the account row is missing.
func (r *ACMERepository) UpdateAccountJWKWithTx(
ctx context.Context,
q repository.Querier,
accountID string,
expectedOldThumbprint, newThumbprint, newJWKPEM string,
) error {
// SELECT … FOR UPDATE serializes concurrent rollovers on the same
// row. The lock is released when the surrounding tx commits / rolls
// back; the second concurrent caller blocks here until the first
// commits + then sees the new thumbprint on its read.
var current string
err := q.QueryRowContext(ctx, `
SELECT jwk_thumbprint
FROM acme_accounts
WHERE account_id = $1
FOR UPDATE
`, accountID).Scan(&current)
if err != nil {
if errors.Is(err, sql.ErrNoRows) {
return fmt.Errorf("account not found: %w", repository.ErrNotFound)
}
return fmt.Errorf("acme: lock account for key rollover: %w", err)
}
if current != expectedOldThumbprint {
return ErrACMEAccountKeyConcurrentUpdate
}
res, err := q.ExecContext(ctx, `
UPDATE acme_accounts
SET jwk_thumbprint = $2, jwk_pem = $3, updated_at = NOW()
WHERE account_id = $1 AND jwk_thumbprint = $4
`, accountID, newThumbprint, newJWKPEM, expectedOldThumbprint)
if err != nil {
// Postgres SQLSTATE 23505 = unique_violation on
// (profile_id, jwk_thumbprint). RFC 8555 §7.3.5 says the
// server SHOULD return 409 Conflict in that case.
var pqErr *pq.Error
if errors.As(err, &pqErr) && pqErr.Code == "23505" {
return ErrACMEAccountDuplicateThumbprint
}
return fmt.Errorf("acme: update account jwk: %w", err)
}
n, err := res.RowsAffected()
if err != nil {
return fmt.Errorf("acme: update account jwk rows affected: %w", err)
}
if n == 0 {
// SELECT … FOR UPDATE found the row but the UPDATE WHERE
// jwk_thumbprint=$4 did not. That means the thumbprint changed
// between the SELECT result + the UPDATE — should be impossible
// under the row lock. Treat defensively as a concurrency error.
return ErrACMEAccountKeyConcurrentUpdate
}
return nil
}
// AccountOwnsCertificate returns true when there is at least one
// acme_orders row with (account_id = accountID, certificate_id =
// certificateID). Used by ACMEService.RevokeCert (kid path) to confirm
// the requesting account had an order that produced the cert. The
// query is a single indexed lookup; we don't materialize the full row.
func (r *ACMERepository) AccountOwnsCertificate(ctx context.Context, accountID, certificateID string) (bool, error) {
if accountID == "" || certificateID == "" {
return false, nil
}
var count int
err := r.db.QueryRowContext(ctx, `
SELECT COUNT(1)
FROM acme_orders
WHERE account_id = $1 AND certificate_id = $2
`, accountID, certificateID).Scan(&count)
if err != nil {
return false, fmt.Errorf("acme: account-owns-certificate: %w", err)
}
return count > 0, nil
}
// scanACMEAccount is the shared shape for the SELECT-by-X account
// queries above. Returns sql.ErrNoRows-wrapped repository.ErrNotFound
// on miss; any other scan failure surfaces verbatim.