mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-12 17:59:24 +00:00
auth-bundle-2 Phase 4: session service (cookie minting + signature
validation, idle/absolute expiry, signing-key rotation, CSRF, GC),
15-case negative-test matrix, fail-fatal initial-key bootstrap
Phase 4 of the bundle ships the post-login session lifecycle that backs
every authenticated request once Phase 5 wires the OIDC handlers + the
session middleware. The state machine is the load-bearing primitive for
the Bundle 2 control plane: forge a session cookie and you bypass every
RBAC gate.
Service surface (internal/auth/session/service.go, ~880 LOC):
- Service.Create(actorID, actorType, ip, ua) -> *CreateResult
Mints a session row; signs the cookie value with the active signing
key; returns the cookie payload AND the CSRF token plaintext for
the handler to set on the response.
- Service.Validate(ValidateInput) -> *Session
Parses the cookie, looks up the signing key (incl. retired-but-in-
retention), recomputes HMAC-SHA256, loads the session row, enforces
revocation + absolute + idle expiry + optional IP/UA bind. Maps to
one of 9 sentinel errors; the handler uniformly returns 401 to the
wire (specific reason in the audit row).
- Service.ValidateCSRF(headerValue, *Session) error
Constant-time compares SHA-256(header) against the stored hash on
the session row.
- Service.UpdateLastSeen / Revoke / RevokeAllForActor
- Service.RotateCSRFToken — mints fresh token, persists hash, returns
plaintext; called on login completion, logout, role-change against
actor, explicit operator rotate.
- Service.RotateSigningKey — mints new active key, retires previous;
retired keys stay valid for cfg.SigningKeyRetention so existing
cookies don't immediately fail.
- Service.EnsureInitialSigningKey — idempotent; mints first key on
fresh deploys; emits auth.session_signing_key_bootstrap audit row
with event_category=auth. Wired into cmd/server/main.go AFTER
migrations + RBAC backfill, BEFORE the HTTP listener binds; failure
is FATAL (logger.Error + os.Exit(1)) per the prompt — server refuses
to boot rather than serve session-less.
- Service.GarbageCollect — sweeps expired post-login sessions +
pre-login rows >10min + retired-past-retention signing keys. Wired
into the new internal/scheduler/scheduler.go::sessionGCLoop on a
CERTCTL_SESSION_GC_INTERVAL tick.
Cookie wire format (load-bearing):
v1.<session_id>.<signing_key_id>.<base64url-no-pad(HMAC-SHA256)>
The HMAC input is LENGTH-PREFIXED to defeat concatenation collisions:
len(session_id) || ":" || session_id || ":" || len(signing_key_id) || ":" || signing_key_id
where len(...) is the ASCII decimal byte-length. Without the length
prefix, the bare-concatenation form `session_id || signing_key_id`
would let a forger swap one byte across the boundary — `<a, bc>` and
`<ab, c>` produce identical HMAC inputs. The length prefix moves the
boundary into the input itself so the two cases can never collide.
The v1. version prefix is reserved. A future incompatible upgrade
ships as v2. and the parser rejects unknown prefixes (no fallback).
CSRF token model:
- Plaintext goes in a JS-readable certctl_csrf cookie (HttpOnly=false
intentional; the GUI must read it to echo into X-CSRF-Token header).
- SHA-256 hash of the plaintext lives on the session row.
- Validation: SHA-256(X-CSRF-Token) constant-time-compared.
- Rotated by Service.RotateCSRFToken on login / logout / role-change /
explicit admin-trigger.
Optional defense-in-depth (default OFF):
- CERTCTL_SESSION_BIND_IP — Validate compares client IP to row's
recorded IP. Mismatch -> 401, audit row, session NOT auto-revoked
(user may have legitimate IP change). Mobile + corporate-NAT
environments leave this off.
- CERTCTL_SESSION_BIND_USER_AGENT — same shape against UA.
Configurable lifetimes (env vars wired in internal/config/config.go):
CERTCTL_SESSION_IDLE_TIMEOUT 1h
CERTCTL_SESSION_ABSOLUTE_TIMEOUT 8h
CERTCTL_SESSION_SIGNING_KEY_RETENTION 24h
CERTCTL_SESSION_GC_INTERVAL 1h
CERTCTL_SESSION_SAMESITE Lax
CERTCTL_SESSION_BIND_IP false
CERTCTL_SESSION_BIND_USER_AGENT false
Test surface (internal/auth/session/service_test.go, ~860 LOC):
All 15 prompt-mandated negative cases:
1. Tampered cookie (HMAC byte flipped near segment start where all
6 bits are real — base64url-no-pad's last char carries only 2
bits so a tail-flip is unreliable).
1b. Tampered SESSION_ID segment (same HMAC-recompute outcome).
2. Cookie missing v1. prefix.
3. Cookie with unknown version prefix (v99).
4. Idle expiry — back-dated last_seen_at + idle_expires_at.
5. Absolute expiry — back-dated absolute_expires_at.
6. Revoked session.
7. Wrong signing key id (no row matches).
8. Cookie signed under retired-but-in-retention key SUCCEEDS.
9. Cookie signed under retired-past-retention key FAILS.
10. Concatenation collision — direct evidence that
computeHMAC("abc","de") != computeHMAC("ab","cde") AND that
a forged-boundary-slide cookie is rejected.
11. CSRF token missing.
12. CSRF token mismatch (constant-time compare).
13. IP-bind enabled + IP changed -> ErrSessionIPMismatch + audit row.
14. UA-bind enabled + UA changed -> ErrSessionUAMismatch + audit row.
15. EnsureInitialSigningKey RNG failure -> ErrInitialSigningKeyMintFailed
wrap (cmd/server/main.go treats as fatal).
Plus coverage-lift batch covering: every error wrap on every repo
collaborator (Create, Get, UpdateLastSeen, UpdateCSRFTokenHash,
Revoke, RevokeAllForActor, GC), every RNG-failure surface in Create /
RotateCSRFToken / RotateSigningKey, every alg-pinning helper edge,
the cookie parser's full negative matrix (empty, wrong segment count,
missing prefixes, bad base64, wrong HMAC length), and a real-encryption
round-trip via internal/crypto.EncryptIfKeySet -> DecryptIfKeySet so
the v3-blob path is exercised end-to-end at the session-cookie level.
Coverage:
internal/auth/session 94.5% (floor 90)
internal/auth/session/domain 96+% (floor 90, Phase 1)
.github/coverage-thresholds.yml extended with 2 new gate entries
(internal/auth/session and internal/auth/session/domain). The
why: paragraphs explain why each fail-closed branch is load-bearing.
Repository extensions:
internal/repository/session.go gains UpdateCSRFTokenHash on the
SessionRepository interface; internal/repository/postgres/session.go
ships the implementation. RotateCSRFToken consumes it.
Scheduler extensions:
internal/scheduler/scheduler.go gains SessionGarbageCollector
interface + sessionGC field + sessionGCInterval +
SetSessionGarbageCollector + SetSessionGCInterval + sessionGCLoop.
Pattern matches the existing acmeGCLoop: atomic.Bool guard prevents
concurrent sweeps, sync.WaitGroup tracks for graceful shutdown,
per-tick context.WithTimeout(1m) bounds a stuck Postgres.
Server wiring:
cmd/server/main.go constructs sessionService AFTER the bootstrap
block (post-RBAC backfill) and BEFORE the policy-service block.
EnsureInitialSigningKey runs immediately; failure is fatal via
os.Exit(1). The scheduler section wires SetSessionGarbageCollector
+ SetSessionGCInterval alongside the other interval setters and
emits an Info log so operators can confirm the loop is enabled.
Phase 4 deviation note: Service.GarbageCollect() returns (int, error)
rather than the prompt's literal `error`. The int is the count of
session rows deleted on this sweep; the scheduler discards it (`_, err
:= ...`) but tests + future operator-facing audit rows can read it.
The wider behavior matches the spec exactly.
Verifications: gofmt clean, go vet ./internal/auth/session/...
./internal/scheduler/... ./internal/config/... ./cmd/server/...
./internal/repository/... clean, go test -short -count=1 -race green
across all 3 session packages, full repository + auth + scheduler +
config test sweeps green, no regressions in Bundle 1 packages.
This commit is contained in:
@@ -84,6 +84,14 @@ type ACMEGarbageCollector interface {
|
||||
GarbageCollect(ctx context.Context) error
|
||||
}
|
||||
|
||||
// SessionGarbageCollector is the interface the scheduler's sessionGCLoop
|
||||
// invokes once per CERTCTL_SESSION_GC_INTERVAL tick. Concrete impl is
|
||||
// *session.Service. Sweeps expired post-login + pre-login session rows
|
||||
// AND retired-past-retention signing-key rows. Auth Bundle 2 Phase 4.
|
||||
type SessionGarbageCollector interface {
|
||||
GarbageCollect(ctx context.Context) (int, error)
|
||||
}
|
||||
|
||||
// JobReaperService defines the interface for job timeout reaping used by the scheduler.
|
||||
type JobReaperService interface {
|
||||
ReapTimedOutJobs(ctx context.Context, csrTTL, approvalTTL time.Duration) error
|
||||
@@ -109,6 +117,7 @@ type Scheduler struct {
|
||||
cloudDiscoveryService CloudDiscoveryServicer
|
||||
crlCacheService CRLCacheServicer
|
||||
acmeGC ACMEGarbageCollector
|
||||
sessionGC SessionGarbageCollector
|
||||
jobReaper JobReaperService
|
||||
logger *slog.Logger
|
||||
|
||||
@@ -127,6 +136,7 @@ type Scheduler struct {
|
||||
crlGenerationInterval time.Duration
|
||||
jobTimeoutInterval time.Duration
|
||||
acmeGCInterval time.Duration
|
||||
sessionGCInterval time.Duration
|
||||
// agentOfflineJobTTL: per-tick threshold for reaping Running jobs whose
|
||||
// owning agent has been silent. Bundle C / Audit M-016. Defaults below.
|
||||
agentOfflineJobTTL time.Duration
|
||||
@@ -148,6 +158,7 @@ type Scheduler struct {
|
||||
crlGenerationRunning atomic.Bool
|
||||
jobTimeoutRunning atomic.Bool
|
||||
acmeGCRunning atomic.Bool
|
||||
sessionGCRunning atomic.Bool
|
||||
|
||||
// Graceful shutdown: wait for in-flight work to complete
|
||||
wg sync.WaitGroup
|
||||
@@ -185,6 +196,7 @@ func NewScheduler(
|
||||
crlGenerationInterval: 1 * time.Hour,
|
||||
jobTimeoutInterval: 10 * time.Minute,
|
||||
acmeGCInterval: 1 * time.Minute,
|
||||
sessionGCInterval: 1 * time.Hour,
|
||||
// 5 minutes is 5×agentHealthCheckInterval default of 1m; an agent
|
||||
// must miss multiple heartbeats before its in-flight jobs are reaped.
|
||||
agentOfflineJobTTL: 5 * time.Minute,
|
||||
@@ -317,6 +329,23 @@ func (s *Scheduler) SetACMEGCInterval(d time.Duration) {
|
||||
s.acmeGCInterval = d
|
||||
}
|
||||
|
||||
// SetSessionGarbageCollector wires the Auth Bundle 2 Phase 4 session GC
|
||||
// service. Optional; nil disables the loop (Bundle-2-disabled deployments
|
||||
// still run pre-Phase-4 behavior).
|
||||
func (s *Scheduler) SetSessionGarbageCollector(gc SessionGarbageCollector) {
|
||||
s.sessionGC = gc
|
||||
}
|
||||
|
||||
// SetSessionGCInterval configures the interval at which the session GC
|
||||
// sweep runs. Default 1h. Wire: CERTCTL_SESSION_GC_INTERVAL. Zero or
|
||||
// negative values are ignored.
|
||||
func (s *Scheduler) SetSessionGCInterval(d time.Duration) {
|
||||
if d <= 0 {
|
||||
return
|
||||
}
|
||||
s.sessionGCInterval = d
|
||||
}
|
||||
|
||||
// SetAgentOfflineJobTTL sets the threshold past which a Running job whose
|
||||
// owning agent has gone silent is reaped to Failed. Bundle C / Audit M-016.
|
||||
// Zero or negative values are ignored (the default of 5 minutes is kept).
|
||||
@@ -375,6 +404,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
|
||||
if s.acmeGC != nil {
|
||||
loopCount++
|
||||
}
|
||||
if s.sessionGC != nil {
|
||||
loopCount++
|
||||
}
|
||||
s.wg.Add(loopCount)
|
||||
|
||||
go func() { defer s.wg.Done(); s.renewalCheckLoop(ctx) }()
|
||||
@@ -403,6 +435,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
|
||||
if s.acmeGC != nil {
|
||||
go func() { defer s.wg.Done(); s.acmeGCLoop(ctx) }()
|
||||
}
|
||||
if s.sessionGC != nil {
|
||||
go func() { defer s.wg.Done(); s.sessionGCLoop(ctx) }()
|
||||
}
|
||||
|
||||
// Signal that all loops are launched
|
||||
close(startedChan)
|
||||
@@ -1146,3 +1181,40 @@ func (s *Scheduler) acmeGCLoop(ctx context.Context) {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// sessionGCLoop runs every sessionGCInterval and invokes
|
||||
// SessionGarbageCollector.GarbageCollect, which sweeps:
|
||||
// - sessions whose absolute_expires_at is in the past (post-login expired);
|
||||
// - pre-login session rows older than 10 minutes;
|
||||
// - retired-past-retention session_signing_keys rows.
|
||||
//
|
||||
// Auth Bundle 2 Phase 4. The atomic.Bool guard + the per-tick
|
||||
// context.WithTimeout match the pattern of every other loop in this
|
||||
// file: a stuck Postgres can't block the next tick, and concurrent
|
||||
// sweeps are skipped not queued.
|
||||
func (s *Scheduler) sessionGCLoop(ctx context.Context) {
|
||||
ticker := time.NewTicker(s.sessionGCInterval)
|
||||
defer ticker.Stop()
|
||||
|
||||
for {
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
return
|
||||
case <-ticker.C:
|
||||
if !s.sessionGCRunning.CompareAndSwap(false, true) {
|
||||
s.logger.Warn("session GC sweep still running, skipping tick")
|
||||
continue
|
||||
}
|
||||
s.wg.Add(1)
|
||||
go func() {
|
||||
defer s.wg.Done()
|
||||
defer s.sessionGCRunning.Store(false)
|
||||
opCtx, cancel := context.WithTimeout(ctx, time.Minute)
|
||||
defer cancel()
|
||||
if _, err := s.sessionGC.GarbageCollect(opCtx); err != nil {
|
||||
s.logger.Warn("session gc sweep failed (next tick will retry)", "error", err)
|
||||
}
|
||||
}()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user