Files
certctl/internal/scep/intune/replay.go
T
Shankar 0861aa9482 feat(scep-intune): parser + validator for Microsoft Intune Connector challenge format
Phase 7 of the SCEP RFC 8894 + Intune master bundle. Adds the
internal/scep/intune package that validates Microsoft Intune Certificate
Connector signed challenges embedded in SCEP CSR challengePassword
attributes. This is the parsing/validation foundation; Phase 8 wires it
into the SCEP service dispatcher.

What's included:

  * doc.go — package architecture (Intune cloud → Connector → certctl
    SCEP server) + 'what this package is NOT' guard rails. We do NOT
    implement full JOSE: no JKU / kid / x5c trust, no JWKS fetch.
    Trust anchor is operator-supplied at startup and pinned. The
    package does NOT call Microsoft's API directly — the Connector
    already did that; we validate its signed attestation.

  * trust_anchor.go — LoadTrustAnchor(path) reads a PEM bundle of
    Intune Connector signing certs. Skips non-CERTIFICATE PEM blocks
    (operators sometimes paste chains with the priv key by mistake).
    Rejects empty bundles + expired certs at startup with an
    operator-actionable message including the cert subject. SIGHUP
    reload lands in Phase 8.5; today it's load-once-at-boot.

  * claim.go — ChallengeClaim struct + DeviceMatchesCSR helper.
    Set-equality semantics for SAN-DNS/SAN-RFC822/SAN-UPN: the CSR
    must carry EXACTLY the claim's elements, no extras and no missing.
    Empty claim slice = no constraint on that dimension.
    Per-dimension typed errors (ErrClaimCNMismatch /
    ErrClaimSANDNSMismatch / ErrClaimSANRFC822Mismatch /
    ErrClaimSANUPNMismatch) so audit logs surface the failure
    dimension without string-matching. extractUPNSans is stubbed to
    return nil with documented fail-closed behavior — non-empty UPN
    claims fail the equalSets check (correct behavior; the rare deploy
    that pins UPN SANs hot-fixes the ASN.1 walker per the inline
    comment).

  * replay.go — ReplayCache: bounded in-memory cache of seen nonces
    with TTL. Sized for 100,000 entries (60-min Connector validity ×
    25 RPS Intune fleet steady-state ≈ 90,000 challenges/hour with
    headroom). sync.Map for concurrent read/write; janitor goroutine
    wakes every TTL/4 to evict expired entries; at-cap O(N)
    oldest-eviction (rarely fires; janitor keeps the cache below
    cap). Redis-backed variant deferred to V3-Pro.

  * challenge.go — the load-bearing piece:

    - ParseChallenge(raw) splits the JWT-like compact serialization
      into header/payload/signature and base64url-decodes each.
      Tolerates both padded + unpadded encodings (some Connector
      builds emit padded; RFC 7515 §2 says unpadded; we accept both).
      Validates the header parses as JSON before returning so the
      malformed-signal lands earlier in the pipeline.

    - ValidateChallenge(raw, trust, expectedAudience, now):
        1. ParseChallenge
        2. JWS signature verify over (segment0 || '.' || segment1)
           — re-derived from the raw on-wire bytes, NOT
           re-base64-encoded, per RFC 7515 §3.1 (re-encoding could
           produce a byte-different input than what was signed)
        3. Signature alg dispatch:
             RS256: rsa.VerifyPKCS1v15(SHA-256)
             ES256: tries fixed-width r||s (JOSE-canonical) first,
                    falls back to ASN.1 DER (older Connectors)
             alg=none: explicit reject with audit-log-friendly
                       message (RFC 7515 §3.6 attack vector)
             HS*/PS*: rejected as 'unsupported alg' (no shared
                      secret in our threat model)
        4. Version-detection prelude (versionedChallenge struct +
           versionUnmarshalers map). Today's format is v1 (no
           explicit version field; absence IS the v1 signal). Adding
           v2 = adding a parser + a registration line; v1 path stays
           untouched. Defends against the inevitable Microsoft format
           change at ~30 LoC + 2 tests cost vs. a P0 incident.
        5. Time bounds (iat / exp); audience pin (skipped when
           expectedAudience == "").

      Replay protection is the CALLER's job (handler glues parser +
      cache; validator stays stateless + testable).

  * Typed errors: ErrChallengeMalformed / ErrChallengeSignature /
    ErrChallengeExpired / ErrChallengeNotYetValid /
    ErrChallengeWrongAudience / ErrChallengeReplay /
    ErrChallengeUnknownVersion. errors.Is-friendly so the handler
    can audit failure dimension.

Tests (94.8% coverage):

  * challenge_test.go (18 tests): happy-path RS256 + ES256
    fixed-width + ES256 DER; TamperedSignature; TamperedPayload;
    Expired; NotYetValid; WrongAudience; EmptyExpectedAudience
    disables check; RotatedTrustAnchor; EmptyTrustBundle;
    AlgNoneRejected; UnsupportedAlg (HS256); MissingAlg;
    VersionV1ExplicitOK; VersionUnknownRejected;
    MixedTrustBundle iter (skip key-type mismatches without
    surfacing as Signature err); NonJSONPayloadButValidSignature;
    Malformed cases (empty, missing dots, bad base64, non-JSON
    header — 9 sub-cases); PaddedBase64Tolerated.

  * claim_test.go (13 tests): per-dimension matching across CN +
    SAN-DNS + SAN-RFC822 + SAN-UPN; nil guards; case-insensitive DNS
    (RFC 4343); dedupe set-equality; empty claim = no constraint;
    UPN stub canary; normaliseSet edge cases; equalSets length
    mismatch.

  * replay_test.go (11 tests): first-fresh; duplicate-rejected;
    past-TTL-fresh; Sweep-evicts-expired; empty-nonce
    short-circuits; at-cap LRU eviction; default-cap=100k;
    Close-idempotent; TTL=0 disables janitor; concurrent-race-free
    (50 goroutines × 200 inserts); empty-nonce twice is fresh both
    times (we don't cache empties).

  * trust_anchor_test.go: HappyPath single + multi cert; SkipsNonCertBlocks
    (priv key + cert mix); EmptyBundleRejected; OnlyKeyBlocksRejected;
    ExpiredCertRejected (with subject CN in error); MalformedCertRejected;
    LoadTrustAnchor disk + EmptyPath + MissingFile.

  * fuzz_test.go: FuzzParseChallenge with seed corpus covering both
    the well-formed and the obvious-malformed shapes. Survived 187k
    execs in 21s without panic on the local burst; CI runs 5 min.

Verification:

  * gofmt -l ./internal/scep/intune: clean
  * go vet ./internal/scep/intune/...: clean
  * staticcheck ./internal/scep/intune/...: clean
  * go test -count=1 -cover ./internal/scep/intune/...: 94.8%
    (target was ≥85%)
  * go vet ./internal/... ./cmd/...: clean (no rest-of-repo regressions)
  * No new CERTCTL_* env vars (those land in Phase 8 with the
    config gate); G-3 docs-drift CI guard not triggered.
  * No new HTTP routes; openapi-parity guard not triggered.

Phase 8 will:
  - Add SCEPProfileConfig.Intune* env vars + preflight gate
  - Wire the validator into the SCEP service dispatcher
    (Intune-shaped challenges → validator; static → existing path)
  - Trust-anchor SIGHUP reload mirroring cmd/server/tls.go::watchSIGHUP
  - Per-claim rate limit + audit metrics

Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 7
      cowork/scep-rfc8894-intune/progress.md
2026-04-29 14:38:35 +00:00

192 lines
5.8 KiB
Go

package intune
import (
"sync"
"time"
)
// ReplayCache is a bounded in-memory cache of seen Intune challenge
// nonces with TTL. Gates against the same Connector-signed challenge
// being replayed against the SCEP server within its validity window.
//
// SCEP RFC 8894 + Intune master bundle Phase 7.4b.
//
// Sizing rationale (cap = 100,000 entries):
//
// - Microsoft's published Connector defaults give each challenge
// a 60-minute validity window. A high-volume Intune fleet
// enrolling at ~25 RPS hits ~90,000 challenges/hour.
// - Capping at 100,000 covers the steady-state load with headroom.
// When the cap is hit, the janitor goroutine evicts entries past
// TTL first; if all entries are still in-window, oldest-first
// eviction kicks in (LRU semantics) — accepting the small
// replay-window risk over an OOM crash.
// - Operators who push beyond this rate should flip to a Redis-
// backed implementation (deferred to V3-Pro per the master
// prompt's deferral list); the in-memory variant is V2 default.
//
// Concurrency: sync.Map handles concurrent read/write without an
// explicit lock; the janitor goroutine periodically walks for expired
// entries. Cap enforcement on Insert is done under a small mutex so
// the cap check + size update are atomic.
type ReplayCache struct {
entries sync.Map // nonce → expiry (time.Time)
mu sync.Mutex // guards size + janitor lifecycle
size int // approximate count (sync.Map has no Len)
cap int // max entries before LRU eviction kicks in
ttl time.Duration
stop chan struct{}
stopOnce sync.Once
}
// NewReplayCache returns a ReplayCache with the given TTL + cap. Starts
// a janitor goroutine that wakes every TTL/4 to evict expired entries.
// Caller MUST call Close when done to stop the goroutine.
//
// TTL = 0 disables the janitor (useful for tests that drive expiry
// manually).
// cap = 0 defaults to 100,000 (the rationale-documented production
// default).
func NewReplayCache(ttl time.Duration, capHint int) *ReplayCache {
if capHint <= 0 {
capHint = 100_000
}
c := &ReplayCache{
cap: capHint,
ttl: ttl,
stop: make(chan struct{}),
}
if ttl > 0 {
go c.janitor()
}
return c
}
// CheckAndInsert returns true when the nonce has NOT been seen before
// (i.e. the challenge is not a replay) AND records the nonce as seen
// with expiry = now + c.ttl. Returns false when the nonce was already
// seen and is still within its TTL window — the caller should treat
// this as a replay attack and reject the challenge.
//
// At-cap behavior: when the cache is full, CheckAndInsert evicts the
// oldest entry (a single Range pass to find min-expiry) before
// inserting. This is O(N) at the boundary; in practice the janitor
// keeps the cache below cap so the eviction path rarely fires.
func (c *ReplayCache) CheckAndInsert(nonce string, now time.Time) bool {
if nonce == "" {
// Empty nonce can't be tracked meaningfully; treat as 'fresh'
// — the caller's claim-validation should reject empty-nonce
// challenges separately (it's a Connector-emitted-format bug).
return true
}
if existing, ok := c.entries.Load(nonce); ok {
if existingExpiry, _ := existing.(time.Time); now.Before(existingExpiry) {
return false // replay
}
// Past TTL; drop + treat as fresh (race-safe: even if two
// goroutines see the expired entry, both proceed and the second
// Insert wins).
c.delete(nonce)
}
// At-cap LRU eviction.
c.mu.Lock()
if c.size >= c.cap {
c.evictOldestLocked()
}
c.size++
c.mu.Unlock()
c.entries.Store(nonce, now.Add(c.ttl))
return true
}
// Close stops the janitor goroutine. Safe to call multiple times.
func (c *ReplayCache) Close() {
c.stopOnce.Do(func() {
close(c.stop)
})
}
// Sweep walks the entries and evicts any past TTL. Public so tests
// can drive expiry without waiting for the janitor's tick. Returns
// the number of entries evicted.
func (c *ReplayCache) Sweep(now time.Time) int {
evicted := 0
c.entries.Range(func(k, v any) bool {
expiry, _ := v.(time.Time)
if !now.Before(expiry) {
c.delete(k.(string))
evicted++
}
return true
})
return evicted
}
// delete is the size-tracked counterpart to entries.Delete. The size
// counter is approximate (sync.Map.Range races with Insert), but the
// approximation only affects cap enforcement timing — never causes a
// false replay rejection.
func (c *ReplayCache) delete(nonce string) {
if _, loaded := c.entries.LoadAndDelete(nonce); loaded {
c.mu.Lock()
if c.size > 0 {
c.size--
}
c.mu.Unlock()
}
}
// evictOldestLocked is called under c.mu held. Walks entries to find
// the entry with the minimum expiry (i.e. the oldest entry — closest
// to its TTL deadline) and removes it. O(N) but rarely hit; the
// janitor keeps the cache below cap.
func (c *ReplayCache) evictOldestLocked() {
var oldestKey string
var oldestExpiry time.Time
first := true
c.entries.Range(func(k, v any) bool {
expiry, _ := v.(time.Time)
if first || expiry.Before(oldestExpiry) {
oldestKey = k.(string)
oldestExpiry = expiry
first = false
}
return true
})
if oldestKey != "" {
if _, loaded := c.entries.LoadAndDelete(oldestKey); loaded && c.size > 0 {
c.size--
}
}
}
// janitor wakes every ttl/4 and sweeps expired entries. Background-only;
// the test harness can drive expiry deterministically via Sweep.
func (c *ReplayCache) janitor() {
interval := c.ttl / 4
if interval <= 0 {
interval = 1 * time.Minute
}
t := time.NewTicker(interval)
defer t.Stop()
for {
select {
case <-c.stop:
return
case <-t.C:
c.Sweep(time.Now())
}
}
}
// Len returns the approximate cache size for observability. Not
// load-stable; use only for metrics + debug logs.
func (c *ReplayCache) Len() int {
c.mu.Lock()
defer c.mu.Unlock()
return c.size
}