mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-11 20:58:52 +00:00
2263e2886b
Phase 8 of the SCEP RFC 8894 + Intune master bundle. Wires the internal/scep/intune validator from Phase 7 into the SCEPService dispatch path, with a SIGHUP-reloadable trust anchor holder, a per-(Subject, Issuer) sliding-window rate limiter, and a nil-default ComplianceCheck seam for V3-Pro. Operator-visible surface (per-profile, all default to off): CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_ENABLED=true CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CONNECTOR_CERT_PATH=/etc/certctl/intune.pem CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_AUDIENCE=https://certctl.example.com/scep/corp CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CHALLENGE_VALIDITY=60m CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_PER_DEVICE_RATE_LIMIT_24H=3 Per-profile dispatch (Phase 8.8): an operator running corp-laptops through Intune AND IoT devices through static challenge configures INTUNE_ENABLED=true on the corp profile only — the IoT profile's PKCSReq path skips the dispatcher entirely. Mirrors the per-profile shape established by Phase 1.5. Wire-in surfaces: * config.go (Phase 8.1): SCEPProfileConfig.Intune sub-config of type SCEPIntuneProfileConfig (Enabled/ConnectorCertPath/Audience/ ChallengeValidity/PerDeviceRateLimit24h). Loaded from the indexed CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_* env-var family. Per-profile Validate gate refuses INTUNE_ENABLED=true with empty ConnectorCertPath OR negative PerDeviceRateLimit24h. * cmd/server/main.go (Phase 8.2 + wire-in): preflightSCEPIntuneTrustAnchor helper mirrors preflightSCEPRACertKey/preflightSCEPMTLSTrustBundle shape — fail-loud at boot when the trust anchor file is missing / unreadable / empty / contains an expired cert. The per-profile loop builds the holder + replay cache + rate limiter, calls SetIntuneIntegration on the SCEPService, and starts the SIGHUP watcher. A deferred sweep stops every watcher at shutdown. * internal/scep/intune/trust_anchor_holder.go (Phase 8.5): TrustAnchorHolder mirrors cmd/server/tls.go::certHolder. RWMutex- guarded pool + Reload that swaps a fresh slice on success + WatchSIGHUP goroutine that responds to the same SIGHUP the existing TLS-cert watcher uses. A bad reload (parse error, expired cert) keeps the OLD pool in place so a half-rotation doesn't take Intune enrollment down — same fail-safe pattern. Operators rotate via the on-disk file then 'kill -HUP <certctl-pid>'. * internal/scep/intune/rate_limit.go (Phase 8.6): hand-rolled sliding-window-log limiter keyed by (Subject, Issuer). 100k-entry map cap (matches replay cache); at-cap drops the bucket whose newest timestamp is the oldest. Default 3 enrollments per 24h covers legitimate first-cert + recovery + post-wipe re-enrollment but blocks bulk enumeration from a compromised Connector signing key. maxN <= 0 disables the limiter for tests + the rare operator who wants no per-device cap. Empty subject short-circuits to allow (defense-in-depth: caller's claim validation rejects empty-subject upstream; no shared bucket on ''). Why hand-rolled instead of golang.org/x/time/rate: the rate package is in go.sum as an indirect transitive but not a direct dep. ~30 LoC of stdlib avoids creating a new direct dep. * internal/service/scep.go (Phase 8.3 + 8.4 + 8.7): - SCEPService gains intuneEnabled / intuneTrust / intuneAudience / intuneValidity / intuneReplayCache / intuneRateLimiter / complianceCheck fields. - SetIntuneIntegration() constructor-time injection wires the per-profile state. Profiles with INTUNE_ENABLED=false never call this method, so they pay zero overhead. - SetComplianceCheck() installs the V3-Pro plug-in (see Phase 8.7). - looksIntuneShaped(): JWT-shape pre-check (length > 200 + exactly two dots). Allowed to false-positive (validator catches malformed → ErrChallengeMalformed); MUST NOT false-negative on real Intune challenges. - dispatchIntuneChallenge(): the load-bearing core. Runs ValidateChallenge → CSR-binding via DeviceMatchesCSR → replay cache CheckAndInsert → per-device Allow → optional ComplianceCheck. Each failure leg increments a typed metric label and emits an audit-friendly Warn log line. - PKCSReq + PKCSReqWithEnvelope + RenewalReqWithEnvelope all call dispatchIntuneChallenge first; on outcome.decided=true they either short-circuit (with a typed-error → SCEPFailInfo mapping) or call processEnrollment with action='scep_pkcsreq_intune' (so audit greps can count Intune-vs-static enrollments). - mapIntuneErrorToFailInfo(): typed-error → SCEPFailInfo per RFC 8894 §3.2.1.4.5 (signature/replay/expired → BadMessageCheck; claim-mismatch → BadRequest; default → BadRequest). - intuneFailReason(): typed-error → metric label ('signature_invalid' / 'expired' / 'rate_limited' / etc.). Default 'malformed' so a previously-unseen error category still surfaces in the metric for follow-up. - ComplianceCheck (Phase 8.7): nil-default no-op gate. V3-Pro plugs in via SetComplianceCheck to call Microsoft Graph's compliance API. Returns (compliant, reason, err). nil-err + compliant=false → CertRep FAILURE + 'compliance' reason in audit. err != nil → fail-safe deny (V3-Pro module is responsible for any 'permit on API failure' policy). * internal/service/scep.go also gains parseCSRForIntune() — small private wrapper around encoding/pem + x509 used by the dispatcher for the claim ↔ CSR binding check (separated from the broader processEnrollment because we want to bind BEFORE consuming the replay-cache slot). Tests (gates: ≥85% coverage on intune package, ≥70% on service): * scep_intune_test.go (in internal/service): 14 dispatcher tests covering happy-path Intune enrollment + static-challenge fallback + tampered-challenge reject + claim-mismatch reject + replay detected + rate-limited + compliance-hook nil-default + compliance- hook denies non-compliant + compliance-hook error fails closed + IntuneEnabled accessor + 'no IntuneEnabled = static path unchanged' regression pin + intuneFailReason mapping for every typed error + looksIntuneShaped boundary cases. * trust_anchor_holder_test.go (in internal/scep/intune): NewLoadsBundle, NewRequiresLogger, NewSurfacesLoadError, ReloadHappyPath, ReloadKeepsOldOnFailure, ReloadKeepsOldOnExpired (the fail-safe semantics that make the SIGHUP path operator-friendly), WatchSIGHUPReloadsPool (real SIGHUP to self with poll-for-swap pattern mirroring cmd/server/tls_test.go), WatchSIGHUPStopIsClean (does NOT fire SIGHUP after stop — same caveat as the TLS test: the Go runtime would otherwise terminate the test runner on the next SIGHUP since signal.Stop has removed the handler). * rate_limit_test.go (in internal/scep/intune): AllowsUpToCap, DistinctKeysIndependent, WindowExpiry, DisabledBypass (maxN=0), NegativeCapDisabled, EmptySubjectShortCircuits (defense-in-depth against an empty-subject DoS chokepoint), DefaultCapsHonored, MapCapEvictsOldest (at-cap eviction branch), ConcurrentRaceFree (50 goroutines × 200 inserts), pruneOlderThan + the no-op case. Verification: * gofmt -l on all touched files: clean * go vet ./... : clean * staticcheck on intune/service/config/cmd-server: clean * go test -count=1 -cover ./internal/scep/intune/...: 94.8% (target ≥85%) * go test -short across intune+service+config+handler+cmd-server: all green * G-3 docs-drift CI guard reproduced locally: docs-only filtered= empty, config-only=empty. The new env vars match the existing CERTCTL_SCEP_ allowlist prefix. Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 8 cowork/scep-rfc8894-intune/progress.md Constitutional rule: 'Always take the complete path, not the easy path' (cowork/CLAUDE.md::Operating Rules) — operator can flip CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_ENABLED=true and observe the dispatcher pick up Intune-shaped challenges end-to-end with no further code changes. Foundation + plumbing ship together.
194 lines
7.0 KiB
Go
194 lines
7.0 KiB
Go
package intune
|
|
|
|
import (
|
|
"errors"
|
|
"sync"
|
|
"time"
|
|
)
|
|
|
|
// SCEP RFC 8894 + Intune master bundle Phase 8.6.
|
|
//
|
|
// PerDeviceRateLimiter is the second line of defense behind the replay cache
|
|
// from Phase 7. The replay cache catches the same challenge being submitted
|
|
// twice (within the challenge TTL); this rate limiter catches a compromised
|
|
// Connector signing key (or a stolen key+cert pair) issuing many DIFFERENT
|
|
// valid challenges for the same device subject in a short window.
|
|
//
|
|
// Threat model:
|
|
//
|
|
// - Replay cache (Phase 7): nonce-keyed; catches duplicate submission.
|
|
// - This limiter: (Subject, Issuer)-keyed; catches enrollment-flooding.
|
|
//
|
|
// Default: 3 enrollments per (device GUID, Connector identity) per 24h.
|
|
//
|
|
// Sizing: 100,000 distinct device entries (matches the replay cache cap).
|
|
// At-cap: oldest entry evicted (small janitor pass) to avoid unbounded
|
|
// memory growth on a fleet that grows past the cap.
|
|
//
|
|
// Why a hand-rolled token bucket instead of pulling in golang.org/x/time/rate:
|
|
// the rate package is in go.sum as an indirect transitive but NOT a direct
|
|
// dep. Adding it would create a new direct dep relationship for ~30 LoC of
|
|
// state machine. The hand-rolled version below uses only stdlib (sync.Mutex
|
|
// + time.Time arithmetic) and is small enough to fit on one screen.
|
|
//
|
|
// Algorithm: each (Subject, Issuer) key maps to a bucket holding a window's
|
|
// worth of recent enrollment timestamps. On Allow, the bucket prunes
|
|
// timestamps older than (now - window) and either appends the current
|
|
// timestamp + returns true, or rejects + returns false when the post-prune
|
|
// count is already at the cap. This is the "sliding window log" rate
|
|
// limiter — exact (no token-leak rounding); O(N_per_key) per-call but N is
|
|
// bounded by the cap (3 by default), so effectively O(1).
|
|
|
|
// ErrRateLimited is the typed error returned when the per-device rate limit
|
|
// fires. The handler maps this to a CertRep FAILURE with badRequest failInfo
|
|
// + the `rate_limited` metric label.
|
|
var ErrRateLimited = errors.New("intune: per-device rate limit exceeded for this (subject, issuer) within the configured window")
|
|
|
|
// PerDeviceRateLimiter is a sliding-window-log rate limiter keyed by
|
|
// (Subject, Issuer) tuples derived from a parsed challenge claim.
|
|
//
|
|
// Concurrency: the limiter is safe for concurrent Allow calls. The internal
|
|
// map is guarded by a mutex; the per-key slices are mutated only while the
|
|
// mutex is held.
|
|
type PerDeviceRateLimiter struct {
|
|
mu sync.Mutex
|
|
buckets map[string][]time.Time // key → sliding window of timestamps
|
|
maxN int // max enrollments per window
|
|
window time.Duration // window length (default 24h)
|
|
cap int // max keys before LRU eviction kicks in
|
|
disabled bool // maxN == 0 → all Allow calls return nil
|
|
}
|
|
|
|
// NewPerDeviceRateLimiter returns a limiter with the given per-key cap +
|
|
// window. maxN ≤ 0 disables the limiter (all Allow calls return nil); this
|
|
// is operator opt-out for the rare case where the per-device cap is
|
|
// undesirable (e.g. test harnesses, sketchpad deploys).
|
|
//
|
|
// Window defaults to 24h when zero. Map cap defaults to 100,000 when zero
|
|
// (matches the replay cache cap; see internal/scep/intune/replay.go).
|
|
func NewPerDeviceRateLimiter(maxN int, window time.Duration, mapCap int) *PerDeviceRateLimiter {
|
|
if window <= 0 {
|
|
window = 24 * time.Hour
|
|
}
|
|
if mapCap <= 0 {
|
|
mapCap = 100_000
|
|
}
|
|
return &PerDeviceRateLimiter{
|
|
buckets: make(map[string][]time.Time),
|
|
maxN: maxN,
|
|
window: window,
|
|
cap: mapCap,
|
|
disabled: maxN <= 0,
|
|
}
|
|
}
|
|
|
|
// Allow checks whether an enrollment for the given (subject, issuer) tuple
|
|
// is permitted right now. Returns nil when allowed (and records the timestamp
|
|
// in the bucket) or ErrRateLimited when the bucket is at maxN.
|
|
//
|
|
// Empty subject is treated as "skip the limiter" — the caller's claim
|
|
// validation should have rejected an empty-subject claim already; this is
|
|
// belt-and-suspenders to prevent a single empty-subject bucket from
|
|
// becoming a fleet-wide chokepoint. The Connector emits non-empty subject
|
|
// (device GUID) on every legitimate challenge.
|
|
func (l *PerDeviceRateLimiter) Allow(subject, issuer string, now time.Time) error {
|
|
if l.disabled {
|
|
return nil
|
|
}
|
|
if subject == "" {
|
|
// Caller's claim validation should reject empty-subject upstream;
|
|
// this short-circuit is defense-in-depth so a misconfigured
|
|
// Connector can't DoS us via the rate-limit path.
|
|
return nil
|
|
}
|
|
key := subject + "|" + issuer
|
|
|
|
l.mu.Lock()
|
|
defer l.mu.Unlock()
|
|
|
|
// At-cap eviction: when the map is full, drop the oldest entry by
|
|
// finding the bucket whose newest timestamp is the smallest. O(N) but
|
|
// rarely fires; the prune-on-Allow path keeps most buckets short-lived.
|
|
if len(l.buckets) >= l.cap {
|
|
l.evictOldestLocked(now)
|
|
}
|
|
|
|
bucket := l.buckets[key]
|
|
bucket = pruneOlderThan(bucket, now.Add(-l.window))
|
|
|
|
if len(bucket) >= l.maxN {
|
|
// Don't append; over the limit. Persist the pruned bucket so the
|
|
// next call sees the most-recently-pruned state.
|
|
l.buckets[key] = bucket
|
|
return ErrRateLimited
|
|
}
|
|
|
|
bucket = append(bucket, now)
|
|
l.buckets[key] = bucket
|
|
return nil
|
|
}
|
|
|
|
// pruneOlderThan returns the slice with all entries strictly before
|
|
// `cutoff` removed. Preserves order (timestamps are appended in increasing
|
|
// time, so a single linear scan from the front suffices).
|
|
func pruneOlderThan(b []time.Time, cutoff time.Time) []time.Time {
|
|
i := 0
|
|
for i < len(b) && b[i].Before(cutoff) {
|
|
i++
|
|
}
|
|
if i == 0 {
|
|
return b
|
|
}
|
|
// Copy-shrink to release the underlying-array memory eventually
|
|
// (otherwise the slice would hold a reference to the older entries
|
|
// indefinitely until a re-allocation).
|
|
out := make([]time.Time, len(b)-i)
|
|
copy(out, b[i:])
|
|
return out
|
|
}
|
|
|
|
// evictOldestLocked drops the map entry whose newest timestamp is the
|
|
// oldest. Called under l.mu. O(N_keys) per eviction; at-cap is rare in
|
|
// practice (caps are sized for fleet steady-state).
|
|
func (l *PerDeviceRateLimiter) evictOldestLocked(now time.Time) {
|
|
var (
|
|
oldestKey string
|
|
oldestTs time.Time
|
|
first = true
|
|
)
|
|
for k, b := range l.buckets {
|
|
if len(b) == 0 {
|
|
// Empty bucket — drop it immediately, no candidate scan needed.
|
|
delete(l.buckets, k)
|
|
return
|
|
}
|
|
newest := b[len(b)-1]
|
|
if first || newest.Before(oldestTs) {
|
|
oldestKey = k
|
|
oldestTs = newest
|
|
first = false
|
|
}
|
|
}
|
|
if oldestKey != "" {
|
|
delete(l.buckets, oldestKey)
|
|
}
|
|
// Suppress unused-parameter warning for `now` in case the eviction
|
|
// strategy changes (e.g. swap to LRU keyed by time of last Allow).
|
|
_ = now
|
|
}
|
|
|
|
// Len returns the approximate number of distinct (subject, issuer) keys
|
|
// currently tracked. For observability + tests; not load-stable under
|
|
// concurrent Allow calls.
|
|
func (l *PerDeviceRateLimiter) Len() int {
|
|
l.mu.Lock()
|
|
defer l.mu.Unlock()
|
|
return len(l.buckets)
|
|
}
|
|
|
|
// Disabled reports whether the limiter is in opt-out mode (maxN ≤ 0).
|
|
// Useful for handler-side gating + admin-endpoint observability.
|
|
func (l *PerDeviceRateLimiter) Disabled() bool {
|
|
return l.disabled
|
|
}
|