EST RFC 7030 hardening master bundle Phases 2-4: end-to-end mTLS sibling

route + RFC 9266 channel binding + HTTP Basic enrollment-password + per-source-IP failed-auth limit + per-(CN, sourceIP) sliding-window cap. Two new shared packages so EST + Intune share infrastructure: - internal/cms/ — RFC 9266 tls-exporter extractor (ExtractTLSExporter with stdlib-panic recovery for synthetic ConnectionStates) + CSR-side channel-binding parser via raw TBSCertificationRequestInfo walk (the stdlib's csr.Attributes can't represent the OCTET STRING binding value), VerifyChannelBinding composite, EmbedChannel- BindingAttribute fixture helper, typed sentinel errors for missing / mismatch / not-TLS-1.3 mapped to HTTP 400 / 409 / 426 in handler. - internal/trustanchor/ — extracted from scep/intune/trust_anchor*.go so the EST mTLS sibling route + Intune dispatcher share the same SIGHUP-reloadable PEM bundle primitive. intune.TrustAnchorHolder is now `= trustanchor.Holder` (type alias) + NewTrustAnchorHolder = trustanchor.New (function alias) — every existing call site compiles unchanged. Intune's LoadTrustAnchor is a thin wrapper over trustanchor.LoadBundle. White-box tests moved to the new package. - internal/ratelimit/ — extracted from scep/intune/rate_limit.go (this was Phase 4.1, in the same bundle). intune.PerDeviceRateLimiter is now a thin wrapper preserving the (subject, issuer)→key composition; EST handler reaches for SlidingWindowLimiter directly. ESTHandler grew six optional fields wired by per-profile setters (SetMTLSTrust / SetChannelBindingRequired / SetEnrollmentPassword / SetSourceIPRateLimiter / SetPerPrincipalRateLimiter / SetLabelForLog) plus four new mTLS-route methods (CACertsMTLS / SimpleEnrollMTLS / SimpleReEnrollMTLS / CSRAttrsMTLS); shared internal pipeline handleEnrollOrReEnroll(reEnroll, viaMTLS) keeps the auth/binding/ rate-limit gates DRY. New router method RegisterESTMTLSHandlers registers /.well-known/est-mtls/<PathID>/{cacerts,simpleenroll, simplereenroll,csrattrs}; AuthExemptDispatchPrefixes extends the no-auth chain to /.well-known/est-mtls. cmd/server/main.go's EST loop wires per-profile mTLS holder + channel-binding policy + per-principal limiter + (when EnrollmentPassword non-empty) Basic + source-IP limiter; new preflightESTMTLSClientCATrust- Bundle returns *trustanchor.Holder so SIGHUP rotates the EST mTLS bundle live without restart. SCEP + EST mTLS profiles now share a single union mtlsUnionPoolForTLS passed to buildServerTLSConfigWithMTLS (replaces the protocol-specific scepMTLSUnionPoolForTLS); per-handler re-verify enforces "cert must chain to THIS profile's bundle" so cross-protocol bleed is blocked at the application layer even though the TLS layer trusts certs from either pool's union. Phase 3.3 source-IP failed-Basic limiter defaults: 10 attempts / 1h / 50k tracked IPs (no env var; tunable in a follow-up). Phase 4.2 per-principal limiter cap from CERTCTL_EST_PROFILE_<NAME>_RATE_ LIMIT_PER_PRINCIPAL_24H (existing field, Phase 1 shipped). New tests: - internal/cms/channelbinding_test.go: extractor + CSR-side parser + composite + TLS-1.3 round-trip end-to-end + EmbedChannelBinding- Attribute round-trip - internal/trustanchor/holder_test.go: parseBundlePEM white-box + LoadBundle + Holder Get/Pool/SetLabelForLog/Reload-happy/ Reload-keeps-old-on-failure/Reload-keeps-old-on-expired/ WatchSIGHUP-reloads-pool/WatchSIGHUP-stop-clean - internal/api/handler/est_hardening_test.go: 16 named cases covering mTLS no-trust-pool 500 + no-cert 401 + cross-profile cert 401 + happy-path 200 + CACertsMTLS auth gate + CSRAttrsMTLS auth gate + channel-binding required-absent-rejected + not-required-absent- allowed + writeChannelBindingError mapping + Basic no-header 401 + Basic wrong-password 401 + Basic correct-200 + Basic-no-password no-gate + per-IP failed-attempt lockout 429 + per-principal blocks-after-cap + different-principals-independent + no-limiter- unbounded. Pre-commit verification (sandbox): gofmt clean, go vet clean (excluding repository/postgres which the sandbox can't build — disk-space testcontainers download), staticcheck clean for cms/trustanchor/api/handler/api/router/scep/intune/ratelimit/ cmd/server, go test -short -count=1 green for cms/trustanchor/ api/handler/api/router/scep/intune/ratelimit/service. G-3 docs-drift guard reproduced locally clean (Phase 1 already documented every new env var; Phases 2-4 added zero new env vars).
2026-06-10 00:28:58 +00:00 · 2026-04-29 23:15:35 +00:00
parent 8cc1153bd9
commit aa139ee0d9
17 changed files with 3273 additions and 728 deletions
@@ -0,0 +1,188 @@
+// Package ratelimit provides shared rate-limit primitives used by
+// authenticated-but-shared-credential code paths (SCEP/Intune
+// per-device challenge enrollment, EST per-principal CSR enrollment,
+// EST HTTP-Basic source-IP failed-auth limiter) where the threat
+// model is "single legitimate identity could mint enrollments
+// faster than any human/fleet workflow would."
+//
+// Origin: this package was extracted from
+// internal/scep/intune/rate_limit.go in the EST RFC 7030 hardening
+// master bundle Phase 4.1 — EST is the third caller after the
+// Intune dispatcher (per-device-GUID cap on enrollment) and the EST
+// per-principal cap (Phase 4.2). The original Intune-package type +
+// constructor + ErrRateLimited sentinel are preserved as type
+// aliases at internal/scep/intune/rate_limit.go so existing call
+// sites compile unchanged. New callers SHOULD use this package
+// directly.
+//
+// Algorithm: sliding window log. Each key maps to a bucket holding
+// timestamps within the configured window. On Allow, the bucket
+// prunes timestamps older than (now - window) and either appends +
+// returns nil, or rejects + returns ErrRateLimited when the
+// post-prune count is already at the cap. Exact (no token-leak
+// rounding); O(N_per_key) per-call but N is bounded by the cap, so
+// effectively O(1).
+//
+// Concurrency: safe for concurrent Allow calls. Internal map guarded
+// by sync.Mutex; per-key slices mutated only while the mutex is
+// held.
+//
+// Memory: bounded by the per-instance map cap (default 100,000 keys;
+// configurable). At-cap eviction drops the oldest entry by newest
+// timestamp — small janitor pass; rarely fires in practice because
+// the prune-on-Allow path keeps most buckets short-lived.
+package ratelimit
+
+import (
+	"errors"
+	"sync"
+	"time"
+)
+
+// ErrRateLimited is returned by SlidingWindowLimiter.Allow when the
+// bucket for the given key is already at the cap. Callers can
+// errors.Is against this sentinel; the underlying message is stable
+// across the package's lifetime so test assertions can match on it.
+var ErrRateLimited = errors.New("ratelimit: per-key cap exceeded for the configured window")
+
+// SlidingWindowLimiter is the sliding-window-log rate limiter.
+//
+// Construct via NewSlidingWindowLimiter. The zero value is NOT
+// usable — the buckets map needs initialisation.
+type SlidingWindowLimiter struct {
+	mu       sync.Mutex
+	buckets  map[string][]time.Time // key → sliding window of timestamps
+	maxN     int                    // max enrollments per window
+	window   time.Duration          // window length (default 24h)
+	cap      int                    // max keys before LRU eviction kicks in
+	disabled bool                   // maxN <= 0 → all Allow calls return nil
+}
+
+// NewSlidingWindowLimiter returns a limiter with the given per-key
+// cap + window. maxN <= 0 disables the limiter (all Allow calls
+// return nil); this is operator opt-out for the rare case where the
+// per-key cap is undesirable (test harnesses, sketchpad deploys).
+//
+// Window defaults to 24h when zero. Map cap defaults to 100,000 when
+// zero (matches the SCEP/Intune replay cache cap).
+func NewSlidingWindowLimiter(maxN int, window time.Duration, mapCap int) *SlidingWindowLimiter {
+	if window <= 0 {
+		window = 24 * time.Hour
+	}
+	if mapCap <= 0 {
+		mapCap = 100_000
+	}
+	return &SlidingWindowLimiter{
+		buckets:  make(map[string][]time.Time),
+		maxN:     maxN,
+		window:   window,
+		cap:      mapCap,
+		disabled: maxN <= 0,
+	}
+}
+
+// Allow reports whether an event keyed by `key` is permitted right
+// now. Returns nil when allowed (and records the timestamp in the
+// bucket) or ErrRateLimited when the bucket is at maxN.
+//
+// Empty key is treated as "skip the limiter" — the caller's
+// validation should have rejected an empty-key event already; this
+// is belt-and-suspenders so a single empty-key bucket doesn't
+// become a chokepoint for every empty-key event. SCEP/Intune
+// callers compose the key as `subject + "|" + issuer`; EST callers
+// compose `cn + "|" + sourceIP` or `sourceIP`-alone for the
+// failed-auth limiter.
+func (l *SlidingWindowLimiter) Allow(key string, now time.Time) error {
+	if l.disabled {
+		return nil
+	}
+	if key == "" {
+		return nil
+	}
+
+	l.mu.Lock()
+	defer l.mu.Unlock()
+
+	// At-cap eviction: when the map is full, drop the oldest entry
+	// by finding the bucket whose newest timestamp is the smallest.
+	// O(N_keys) but rarely fires; the prune-on-Allow path keeps
+	// most buckets short-lived.
+	if len(l.buckets) >= l.cap {
+		l.evictOldestLocked()
+	}
+
+	bucket := l.buckets[key]
+	bucket = pruneOlderThan(bucket, now.Add(-l.window))
+
+	if len(bucket) >= l.maxN {
+		// Don't append; over the limit. Persist the pruned bucket so
+		// the next call sees the most-recently-pruned state.
+		l.buckets[key] = bucket
+		return ErrRateLimited
+	}
+
+	bucket = append(bucket, now)
+	l.buckets[key] = bucket
+	return nil
+}
+
+// pruneOlderThan returns the slice with all entries strictly before
+// `cutoff` removed. Preserves order (timestamps are appended in
+// increasing time, so a single linear scan from the front suffices).
+func pruneOlderThan(b []time.Time, cutoff time.Time) []time.Time {
+	i := 0
+	for i < len(b) && b[i].Before(cutoff) {
+		i++
+	}
+	if i == 0 {
+		return b
+	}
+	// Copy-shrink to release the underlying-array memory eventually
+	// (otherwise the slice would hold a reference to the older
+	// entries indefinitely until a re-allocation).
+	out := make([]time.Time, len(b)-i)
+	copy(out, b[i:])
+	return out
+}
+
+// evictOldestLocked drops the map entry whose newest timestamp is
+// the oldest. Called under l.mu. O(N_keys) per eviction; at-cap is
+// rare in practice (caps are sized for steady-state).
+func (l *SlidingWindowLimiter) evictOldestLocked() {
+	var (
+		oldestKey string
+		oldestTs  time.Time
+		first     = true
+	)
+	for k, b := range l.buckets {
+		if len(b) == 0 {
+			// Empty bucket — drop it immediately, no candidate scan needed.
+			delete(l.buckets, k)
+			return
+		}
+		newest := b[len(b)-1]
+		if first || newest.Before(oldestTs) {
+			oldestKey = k
+			oldestTs = newest
+			first = false
+		}
+	}
+	if oldestKey != "" {
+		delete(l.buckets, oldestKey)
+	}
+}
+
+// Len returns the approximate number of distinct keys currently
+// tracked. For observability + tests; not load-stable under
+// concurrent Allow calls.
+func (l *SlidingWindowLimiter) Len() int {
+	l.mu.Lock()
+	defer l.mu.Unlock()
+	return len(l.buckets)
+}
+
+// Disabled reports whether the limiter is in opt-out mode (maxN <= 0).
+// Useful for handler-side gating + admin-endpoint observability.
+func (l *SlidingWindowLimiter) Disabled() bool {
+	return l.disabled
+}
@@ -0,0 +1,197 @@
+package ratelimit
+
+import (
+	"errors"
+	"fmt"
+	"sync"
+	"testing"
+	"time"
+)
+
+// EST RFC 7030 hardening master bundle Phase 4.1: this test file holds the
+// white-box tests for the SlidingWindowLimiter primitives that used to live
+// in internal/scep/intune/rate_limit_test.go (TestPerDeviceRateLimiter_
+// DefaultCapsHonored, TestPruneOlderThan, TestPruneOlderThan_NoOpWhen
+// NothingToPrune). The behavioral coverage in intune/rate_limit_test.go
+// stays — it exercises the wrapper's (subject, issuer)-composition contract
+// + the empty-subject short-circuit + concurrent race-freedom.
+
+func TestSlidingWindowLimiter_AllowsUpToCap(t *testing.T) {
+	l := NewSlidingWindowLimiter(3, 24*time.Hour, 10)
+	now := time.Now()
+	for i := 0; i < 3; i++ {
+		if err := l.Allow("k", now.Add(time.Duration(i)*time.Minute)); err != nil {
+			t.Fatalf("call %d should be allowed: %v", i+1, err)
+		}
+	}
+	if err := l.Allow("k", now.Add(4*time.Minute)); !errors.Is(err, ErrRateLimited) {
+		t.Fatalf("4th call should be rate-limited; got %v", err)
+	}
+}
+
+func TestSlidingWindowLimiter_DistinctKeysIndependent(t *testing.T) {
+	l := NewSlidingWindowLimiter(1, 24*time.Hour, 10)
+	now := time.Now()
+
+	if err := l.Allow("k-1", now); err != nil {
+		t.Fatalf("first allow: %v", err)
+	}
+	if err := l.Allow("k-2", now); err != nil {
+		t.Fatalf("different key must have its own bucket: %v", err)
+	}
+	if err := l.Allow("k-1", now.Add(1*time.Second)); !errors.Is(err, ErrRateLimited) {
+		t.Fatalf("repeat key should be limited; got %v", err)
+	}
+}
+
+func TestSlidingWindowLimiter_WindowExpiry(t *testing.T) {
+	l := NewSlidingWindowLimiter(2, 1*time.Hour, 10)
+	now := time.Now()
+
+	if err := l.Allow("k", now); err != nil {
+		t.Fatal(err)
+	}
+	if err := l.Allow("k", now.Add(30*time.Minute)); err != nil {
+		t.Fatal(err)
+	}
+	// Inside window — limited.
+	if err := l.Allow("k", now.Add(45*time.Minute)); !errors.Is(err, ErrRateLimited) {
+		t.Fatalf("inside-window 3rd call should be limited: %v", err)
+	}
+	// Past window — slots reopen.
+	if err := l.Allow("k", now.Add(2*time.Hour)); err != nil {
+		t.Fatalf("past-window call should be allowed (window reset): %v", err)
+	}
+}
+
+func TestSlidingWindowLimiter_DisabledBypass(t *testing.T) {
+	l := NewSlidingWindowLimiter(0, 24*time.Hour, 10) // maxN=0 → disabled
+	if !l.Disabled() {
+		t.Fatal("limiter with maxN=0 must report Disabled()=true")
+	}
+	now := time.Now()
+	for i := 0; i < 100; i++ {
+		if err := l.Allow("k", now); err != nil {
+			t.Fatalf("disabled limiter must allow everything: %v", err)
+		}
+	}
+	if got := l.Len(); got != 0 {
+		t.Errorf("disabled limiter Len() = %d, want 0", got)
+	}
+}
+
+func TestSlidingWindowLimiter_NegativeCapDisabled(t *testing.T) {
+	l := NewSlidingWindowLimiter(-1, 24*time.Hour, 10)
+	if !l.Disabled() {
+		t.Fatal("negative maxN must produce a disabled limiter")
+	}
+}
+
+func TestSlidingWindowLimiter_EmptyKeyShortCircuits(t *testing.T) {
+	// Empty key is the caller's defense-in-depth case — caller's validation
+	// upstream should reject empty-key events first. Limiter must not build
+	// a single shared bucket keyed by empty-key — that would be a chokepoint
+	// for every empty-key event.
+	l := NewSlidingWindowLimiter(1, 24*time.Hour, 10)
+	now := time.Now()
+	for i := 0; i < 50; i++ {
+		if err := l.Allow("", now); err != nil {
+			t.Fatalf("empty key must short-circuit (call %d): %v", i, err)
+		}
+	}
+	if got := l.Len(); got != 0 {
+		t.Errorf("Len after 50 empty-key calls = %d, want 0 (no bucket created)", got)
+	}
+}
+
+func TestSlidingWindowLimiter_DefaultCapsHonored(t *testing.T) {
+	// White-box test: exercises the constructor's default-fill branches.
+	// Lives here (not in the intune wrapper test) because the fields
+	// (window + cap) are package-private to ratelimit.
+	l := NewSlidingWindowLimiter(5, 0, 0) // window=0 → 24h default; cap=0 → 100k default
+	if l.window != 24*time.Hour {
+		t.Errorf("default window = %v, want 24h", l.window)
+	}
+	if l.cap != 100_000 {
+		t.Errorf("default cap = %d, want 100000", l.cap)
+	}
+}
+
+func TestSlidingWindowLimiter_MapCapEvictsOldest(t *testing.T) {
+	// Cap of 3 keys to exercise the eviction branch deterministically.
+	l := NewSlidingWindowLimiter(2, 1*time.Hour, 3)
+	now := time.Now()
+
+	for i := 0; i < 3; i++ {
+		key := fmt.Sprintf("k-%d", i)
+		if err := l.Allow(key, now.Add(time.Duration(i)*time.Minute)); err != nil {
+			t.Fatalf("insert %d: %v", i, err)
+		}
+	}
+	if l.Len() != 3 {
+		t.Fatalf("Len = %d, want 3", l.Len())
+	}
+
+	// 4th key forces eviction of k-0 (its newest timestamp is oldest).
+	if err := l.Allow("k-3", now.Add(10*time.Minute)); err != nil {
+		t.Fatalf("4th-key insert: %v", err)
+	}
+	if l.Len() != 3 {
+		t.Errorf("Len after at-cap insert = %d, want 3 (cap honored)", l.Len())
+	}
+}
+
+func TestSlidingWindowLimiter_ConcurrentRaceFree(t *testing.T) {
+	if testing.Short() {
+		t.Skip("race-style test under -short")
+	}
+	l := NewSlidingWindowLimiter(50, 24*time.Hour, 10000)
+	var wg sync.WaitGroup
+	for g := 0; g < 20; g++ {
+		wg.Add(1)
+		go func(id int) {
+			defer wg.Done()
+			now := time.Now()
+			key := fmt.Sprintf("k-%d", id)
+			for i := 0; i < 30; i++ {
+				_ = l.Allow(key, now)
+			}
+		}(g)
+	}
+	wg.Wait()
+	if got := l.Len(); got != 20 {
+		t.Errorf("expected 20 distinct keys; got %d", got)
+	}
+}
+
+// White-box tests for the unexported pruneOlderThan helper. Live in this
+// package because the helper is package-private to ratelimit. The test
+// surface used to live in intune/rate_limit_test.go before the Phase 4.1
+// extraction.
+func TestPruneOlderThan(t *testing.T) {
+	t0 := time.Now()
+	in := []time.Time{
+		t0.Add(-3 * time.Hour),    // pruned (older than cutoff)
+		t0.Add(-2 * time.Hour),    // pruned (older than cutoff)
+		t0.Add(-1 * time.Hour),    // survives (-60m is NEWER than the -90m cutoff)
+		t0.Add(-30 * time.Minute), // survives
+		t0,                        // survives
+	}
+	out := pruneOlderThan(in, t0.Add(-90*time.Minute))
+	if len(out) != 3 {
+		t.Fatalf("len(out) = %d, want 3 (-1h, -30m, t0 all newer than -90m cutoff)", len(out))
+	}
+	if !out[0].Equal(t0.Add(-1 * time.Hour)) {
+		t.Errorf("out[0] = %v, want -1h (oldest surviving entry)", out[0])
+	}
+}
+
+func TestPruneOlderThan_NoOpWhenNothingToPrune(t *testing.T) {
+	t0 := time.Now()
+	in := []time.Time{t0.Add(-1 * time.Minute), t0}
+	out := pruneOlderThan(in, t0.Add(-1*time.Hour))
+	// Same slice header (no copy needed).
+	if len(out) != len(in) {
+		t.Fatalf("len(out) = %d, want %d", len(out), len(in))
+	}
+}