EST RFC 7030 hardening master bundle Phases 2-4: end-to-end mTLS sibling

route + RFC 9266 channel binding + HTTP Basic enrollment-password + per-source-IP failed-auth limit + per-(CN, sourceIP) sliding-window cap. Two new shared packages so EST + Intune share infrastructure: - internal/cms/ — RFC 9266 tls-exporter extractor (ExtractTLSExporter with stdlib-panic recovery for synthetic ConnectionStates) + CSR-side channel-binding parser via raw TBSCertificationRequestInfo walk (the stdlib's csr.Attributes can't represent the OCTET STRING binding value), VerifyChannelBinding composite, EmbedChannel- BindingAttribute fixture helper, typed sentinel errors for missing / mismatch / not-TLS-1.3 mapped to HTTP 400 / 409 / 426 in handler. - internal/trustanchor/ — extracted from scep/intune/trust_anchor*.go so the EST mTLS sibling route + Intune dispatcher share the same SIGHUP-reloadable PEM bundle primitive. intune.TrustAnchorHolder is now `= trustanchor.Holder` (type alias) + NewTrustAnchorHolder = trustanchor.New (function alias) — every existing call site compiles unchanged. Intune's LoadTrustAnchor is a thin wrapper over trustanchor.LoadBundle. White-box tests moved to the new package. - internal/ratelimit/ — extracted from scep/intune/rate_limit.go (this was Phase 4.1, in the same bundle). intune.PerDeviceRateLimiter is now a thin wrapper preserving the (subject, issuer)→key composition; EST handler reaches for SlidingWindowLimiter directly. ESTHandler grew six optional fields wired by per-profile setters (SetMTLSTrust / SetChannelBindingRequired / SetEnrollmentPassword / SetSourceIPRateLimiter / SetPerPrincipalRateLimiter / SetLabelForLog) plus four new mTLS-route methods (CACertsMTLS / SimpleEnrollMTLS / SimpleReEnrollMTLS / CSRAttrsMTLS); shared internal pipeline handleEnrollOrReEnroll(reEnroll, viaMTLS) keeps the auth/binding/ rate-limit gates DRY. New router method RegisterESTMTLSHandlers registers /.well-known/est-mtls/<PathID>/{cacerts,simpleenroll, simplereenroll,csrattrs}; AuthExemptDispatchPrefixes extends the no-auth chain to /.well-known/est-mtls. cmd/server/main.go's EST loop wires per-profile mTLS holder + channel-binding policy + per-principal limiter + (when EnrollmentPassword non-empty) Basic + source-IP limiter; new preflightESTMTLSClientCATrust- Bundle returns *trustanchor.Holder so SIGHUP rotates the EST mTLS bundle live without restart. SCEP + EST mTLS profiles now share a single union mtlsUnionPoolForTLS passed to buildServerTLSConfigWithMTLS (replaces the protocol-specific scepMTLSUnionPoolForTLS); per-handler re-verify enforces "cert must chain to THIS profile's bundle" so cross-protocol bleed is blocked at the application layer even though the TLS layer trusts certs from either pool's union. Phase 3.3 source-IP failed-Basic limiter defaults: 10 attempts / 1h / 50k tracked IPs (no env var; tunable in a follow-up). Phase 4.2 per-principal limiter cap from CERTCTL_EST_PROFILE_<NAME>_RATE_ LIMIT_PER_PRINCIPAL_24H (existing field, Phase 1 shipped). New tests: - internal/cms/channelbinding_test.go: extractor + CSR-side parser + composite + TLS-1.3 round-trip end-to-end + EmbedChannelBinding- Attribute round-trip - internal/trustanchor/holder_test.go: parseBundlePEM white-box + LoadBundle + Holder Get/Pool/SetLabelForLog/Reload-happy/ Reload-keeps-old-on-failure/Reload-keeps-old-on-expired/ WatchSIGHUP-reloads-pool/WatchSIGHUP-stop-clean - internal/api/handler/est_hardening_test.go: 16 named cases covering mTLS no-trust-pool 500 + no-cert 401 + cross-profile cert 401 + happy-path 200 + CACertsMTLS auth gate + CSRAttrsMTLS auth gate + channel-binding required-absent-rejected + not-required-absent- allowed + writeChannelBindingError mapping + Basic no-header 401 + Basic wrong-password 401 + Basic correct-200 + Basic-no-password no-gate + per-IP failed-attempt lockout 429 + per-principal blocks-after-cap + different-principals-independent + no-limiter- unbounded. Pre-commit verification (sandbox): gofmt clean, go vet clean (excluding repository/postgres which the sandbox can't build — disk-space testcontainers download), staticcheck clean for cms/trustanchor/api/handler/api/router/scep/intune/ratelimit/ cmd/server, go test -short -count=1 green for cms/trustanchor/ api/handler/api/router/scep/intune/ratelimit/service. G-3 docs-drift guard reproduced locally clean (Phase 1 already documented every new env var; Phases 2-4 added zero new env vars).
2026-06-13 18:08:57 +00:00 · 2026-04-29 23:15:35 +00:00
parent 8cc1153bd9
commit aa139ee0d9
17 changed files with 3273 additions and 728 deletions
@@ -1,193 +1,87 @@
 package intune

 import (
-	"errors"
-	"sync"
 	"time"
+
+	"github.com/shankar0123/certctl/internal/ratelimit"
 )

 // SCEP RFC 8894 + Intune master bundle Phase 8.6.
 //
-// PerDeviceRateLimiter is the second line of defense behind the replay cache
-// from Phase 7. The replay cache catches the same challenge being submitted
-// twice (within the challenge TTL); this rate limiter catches a compromised
-// Connector signing key (or a stolen key+cert pair) issuing many DIFFERENT
-// valid challenges for the same device subject in a short window.
+// PerDeviceRateLimiter is the second line of defense behind the replay
+// cache from Phase 7. The replay cache catches the same challenge being
+// submitted twice (within the challenge TTL); this rate limiter catches a
+// compromised Connector signing key (or a stolen key+cert pair) issuing
+// many DIFFERENT valid challenges for the same device subject in a short
+// window.
 //
 // Threat model:
 //
 //   - Replay cache (Phase 7): nonce-keyed; catches duplicate submission.
 //   - This limiter: (Subject, Issuer)-keyed; catches enrollment-flooding.
 //
-// Default: 3 enrollments per (device GUID, Connector identity) per 24h.
+// EST RFC 7030 hardening master bundle Phase 4.1: the implementation that
+// used to live in this file was extracted to internal/ratelimit (where it
+// can be shared with EST per-principal + EST HTTP-Basic source-IP rate
+// limiters). PerDeviceRateLimiter is now a thin wrapper around
+// ratelimit.SlidingWindowLimiter that preserves the original
+// (subject, issuer) → key composition in the Allow signature so existing
+// SCEP/Intune callers don't have to change.
 //
-// Sizing: 100,000 distinct device entries (matches the replay cache cap).
-// At-cap: oldest entry evicted (small janitor pass) to avoid unbounded
-// memory growth on a fleet that grows past the cap.
-//
-// Why a hand-rolled token bucket instead of pulling in golang.org/x/time/rate:
-// the rate package is in go.sum as an indirect transitive but NOT a direct
-// dep. Adding it would create a new direct dep relationship for ~30 LoC of
-// state machine. The hand-rolled version below uses only stdlib (sync.Mutex
-// + time.Time arithmetic) and is small enough to fit on one screen.
-//
-// Algorithm: each (Subject, Issuer) key maps to a bucket holding a window's
-// worth of recent enrollment timestamps. On Allow, the bucket prunes
-// timestamps older than (now - window) and either appends the current
-// timestamp + returns true, or rejects + returns false when the post-prune
-// count is already at the cap. This is the "sliding window log" rate
-// limiter — exact (no token-leak rounding); O(N_per_key) per-call but N is
-// bounded by the cap (3 by default), so effectively O(1).
+// New callers SHOULD use ratelimit.SlidingWindowLimiter directly. The
+// EST RFC 7030 Phase 4.2 EST per-principal cap uses the shared package.

-// ErrRateLimited is the typed error returned when the per-device rate limit
-// fires. The handler maps this to a CertRep FAILURE with badRequest failInfo
-// + the `rate_limited` metric label.
-var ErrRateLimited = errors.New("intune: per-device rate limit exceeded for this (subject, issuer) within the configured window")
+// ErrRateLimited is the typed error returned when the per-device rate
+// limit fires. Aliased to ratelimit.ErrRateLimited so errors.Is matches
+// against either name (the SCEP audit closure already pinned the
+// "rate_limited" metric label against this sentinel; the alias preserves
+// sentinel identity across the package boundary).
+var ErrRateLimited = ratelimit.ErrRateLimited

-// PerDeviceRateLimiter is a sliding-window-log rate limiter keyed by
-// (Subject, Issuer) tuples derived from a parsed challenge claim.
-//
-// Concurrency: the limiter is safe for concurrent Allow calls. The internal
-// map is guarded by a mutex; the per-key slices are mutated only while the
-// mutex is held.
+// PerDeviceRateLimiter wraps ratelimit.SlidingWindowLimiter with the
+// (subject, issuer)-composed-key Allow signature the Intune dispatcher
+// uses. Concurrency-safe (the underlying limiter holds the mutex).
 type PerDeviceRateLimiter struct {
-	mu       sync.Mutex
-	buckets  map[string][]time.Time // key → sliding window of timestamps
-	maxN     int                    // max enrollments per window
-	window   time.Duration          // window length (default 24h)
-	cap      int                    // max keys before LRU eviction kicks in
-	disabled bool                   // maxN == 0 → all Allow calls return nil
+	inner *ratelimit.SlidingWindowLimiter
 }

 // NewPerDeviceRateLimiter returns a limiter with the given per-key cap +
-// window. maxN ≤ 0 disables the limiter (all Allow calls return nil); this
-// is operator opt-out for the rare case where the per-device cap is
+// window. maxN ≤ 0 disables the limiter (all Allow calls return nil);
+// this is operator opt-out for the rare case where the per-device cap is
 // undesirable (e.g. test harnesses, sketchpad deploys).
 //
 // Window defaults to 24h when zero. Map cap defaults to 100,000 when zero
 // (matches the replay cache cap; see internal/scep/intune/replay.go).
 func NewPerDeviceRateLimiter(maxN int, window time.Duration, mapCap int) *PerDeviceRateLimiter {
-	if window <= 0 {
-		window = 24 * time.Hour
-	}
-	if mapCap <= 0 {
-		mapCap = 100_000
-	}
-	return &PerDeviceRateLimiter{
-		buckets:  make(map[string][]time.Time),
-		maxN:     maxN,
-		window:   window,
-		cap:      mapCap,
-		disabled: maxN <= 0,
-	}
+	return &PerDeviceRateLimiter{inner: ratelimit.NewSlidingWindowLimiter(maxN, window, mapCap)}
 }

-// Allow checks whether an enrollment for the given (subject, issuer) tuple
-// is permitted right now. Returns nil when allowed (and records the timestamp
-// in the bucket) or ErrRateLimited when the bucket is at maxN.
+// Allow checks whether an enrollment for the given (subject, issuer)
+// tuple is permitted right now. Returns nil when allowed (and records
+// the timestamp in the bucket) or ErrRateLimited when the bucket is at
+// maxN.
 //
 // Empty subject is treated as "skip the limiter" — the caller's claim
-// validation should have rejected an empty-subject claim already; this is
-// belt-and-suspenders to prevent a single empty-subject bucket from
-// becoming a fleet-wide chokepoint. The Connector emits non-empty subject
-// (device GUID) on every legitimate challenge.
+// validation should have rejected an empty-subject claim already; this
+// is belt-and-suspenders to prevent a single empty-subject bucket from
+// becoming a fleet-wide chokepoint.
 func (l *PerDeviceRateLimiter) Allow(subject, issuer string, now time.Time) error {
-	if l.disabled {
-		return nil
-	}
 	if subject == "" {
-		// Caller's claim validation should reject empty-subject upstream;
-		// this short-circuit is defense-in-depth so a misconfigured
-		// Connector can't DoS us via the rate-limit path.
+		// Empty-subject early return preserved from the pre-Phase-4.1
+		// behavior: ratelimit.SlidingWindowLimiter also short-circuits
+		// on empty key, but the explicit check here documents the
+		// (subject, issuer) → empty-key contract and saves one call
+		// frame in the hot path.
 		return nil
 	}
 	key := subject + "|" + issuer
-
-	l.mu.Lock()
-	defer l.mu.Unlock()
-
-	// At-cap eviction: when the map is full, drop the oldest entry by
-	// finding the bucket whose newest timestamp is the smallest. O(N) but
-	// rarely fires; the prune-on-Allow path keeps most buckets short-lived.
-	if len(l.buckets) >= l.cap {
-		l.evictOldestLocked(now)
-	}
-
-	bucket := l.buckets[key]
-	bucket = pruneOlderThan(bucket, now.Add(-l.window))
-
-	if len(bucket) >= l.maxN {
-		// Don't append; over the limit. Persist the pruned bucket so the
-		// next call sees the most-recently-pruned state.
-		l.buckets[key] = bucket
-		return ErrRateLimited
-	}
-
-	bucket = append(bucket, now)
-	l.buckets[key] = bucket
-	return nil
-}
-
-// pruneOlderThan returns the slice with all entries strictly before
-// `cutoff` removed. Preserves order (timestamps are appended in increasing
-// time, so a single linear scan from the front suffices).
-func pruneOlderThan(b []time.Time, cutoff time.Time) []time.Time {
-	i := 0
-	for i < len(b) && b[i].Before(cutoff) {
-		i++
-	}
-	if i == 0 {
-		return b
-	}
-	// Copy-shrink to release the underlying-array memory eventually
-	// (otherwise the slice would hold a reference to the older entries
-	// indefinitely until a re-allocation).
-	out := make([]time.Time, len(b)-i)
-	copy(out, b[i:])
-	return out
-}
-
-// evictOldestLocked drops the map entry whose newest timestamp is the
-// oldest. Called under l.mu. O(N_keys) per eviction; at-cap is rare in
-// practice (caps are sized for fleet steady-state).
-func (l *PerDeviceRateLimiter) evictOldestLocked(now time.Time) {
-	var (
-		oldestKey string
-		oldestTs  time.Time
-		first     = true
-	)
-	for k, b := range l.buckets {
-		if len(b) == 0 {
-			// Empty bucket — drop it immediately, no candidate scan needed.
-			delete(l.buckets, k)
-			return
-		}
-		newest := b[len(b)-1]
-		if first || newest.Before(oldestTs) {
-			oldestKey = k
-			oldestTs = newest
-			first = false
-		}
-	}
-	if oldestKey != "" {
-		delete(l.buckets, oldestKey)
-	}
-	// Suppress unused-parameter warning for `now` in case the eviction
-	// strategy changes (e.g. swap to LRU keyed by time of last Allow).
-	_ = now
+	return l.inner.Allow(key, now)
 }

 // Len returns the approximate number of distinct (subject, issuer) keys
-// currently tracked. For observability + tests; not load-stable under
-// concurrent Allow calls.
-func (l *PerDeviceRateLimiter) Len() int {
-	l.mu.Lock()
-	defer l.mu.Unlock()
-	return len(l.buckets)
-}
+// currently tracked. For observability + tests.
+func (l *PerDeviceRateLimiter) Len() int { return l.inner.Len() }

 // Disabled reports whether the limiter is in opt-out mode (maxN ≤ 0).
 // Useful for handler-side gating + admin-endpoint observability.
-func (l *PerDeviceRateLimiter) Disabled() bool {
-	return l.disabled
-}
+func (l *PerDeviceRateLimiter) Disabled() bool { return l.inner.Disabled() }