auth-bundle-2 Phase 4: session service (cookie minting + signature

validation, idle/absolute expiry, signing-key rotation, CSRF, GC),
15-case negative-test matrix, fail-fatal initial-key bootstrap

Phase 4 of the bundle ships the post-login session lifecycle that backs
every authenticated request once Phase 5 wires the OIDC handlers + the
session middleware. The state machine is the load-bearing primitive for
the Bundle 2 control plane: forge a session cookie and you bypass every
RBAC gate.

Service surface (internal/auth/session/service.go, ~880 LOC):

  - Service.Create(actorID, actorType, ip, ua) -> *CreateResult
    Mints a session row; signs the cookie value with the active signing
    key; returns the cookie payload AND the CSRF token plaintext for
    the handler to set on the response.
  - Service.Validate(ValidateInput) -> *Session
    Parses the cookie, looks up the signing key (incl. retired-but-in-
    retention), recomputes HMAC-SHA256, loads the session row, enforces
    revocation + absolute + idle expiry + optional IP/UA bind. Maps to
    one of 9 sentinel errors; the handler uniformly returns 401 to the
    wire (specific reason in the audit row).
  - Service.ValidateCSRF(headerValue, *Session) error
    Constant-time compares SHA-256(header) against the stored hash on
    the session row.
  - Service.UpdateLastSeen / Revoke / RevokeAllForActor
  - Service.RotateCSRFToken — mints fresh token, persists hash, returns
    plaintext; called on login completion, logout, role-change against
    actor, explicit operator rotate.
  - Service.RotateSigningKey — mints new active key, retires previous;
    retired keys stay valid for cfg.SigningKeyRetention so existing
    cookies don't immediately fail.
  - Service.EnsureInitialSigningKey — idempotent; mints first key on
    fresh deploys; emits auth.session_signing_key_bootstrap audit row
    with event_category=auth. Wired into cmd/server/main.go AFTER
    migrations + RBAC backfill, BEFORE the HTTP listener binds; failure
    is FATAL (logger.Error + os.Exit(1)) per the prompt — server refuses
    to boot rather than serve session-less.
  - Service.GarbageCollect — sweeps expired post-login sessions +
    pre-login rows >10min + retired-past-retention signing keys. Wired
    into the new internal/scheduler/scheduler.go::sessionGCLoop on a
    CERTCTL_SESSION_GC_INTERVAL tick.

Cookie wire format (load-bearing):

  v1.<session_id>.<signing_key_id>.<base64url-no-pad(HMAC-SHA256)>

The HMAC input is LENGTH-PREFIXED to defeat concatenation collisions:

  len(session_id) || ":" || session_id || ":" || len(signing_key_id) || ":" || signing_key_id

where len(...) is the ASCII decimal byte-length. Without the length
prefix, the bare-concatenation form `session_id || signing_key_id`
would let a forger swap one byte across the boundary — `<a, bc>` and
`<ab, c>` produce identical HMAC inputs. The length prefix moves the
boundary into the input itself so the two cases can never collide.

The v1. version prefix is reserved. A future incompatible upgrade
ships as v2. and the parser rejects unknown prefixes (no fallback).

CSRF token model:

  - Plaintext goes in a JS-readable certctl_csrf cookie (HttpOnly=false
    intentional; the GUI must read it to echo into X-CSRF-Token header).
  - SHA-256 hash of the plaintext lives on the session row.
  - Validation: SHA-256(X-CSRF-Token) constant-time-compared.
  - Rotated by Service.RotateCSRFToken on login / logout / role-change /
    explicit admin-trigger.

Optional defense-in-depth (default OFF):

  - CERTCTL_SESSION_BIND_IP — Validate compares client IP to row's
    recorded IP. Mismatch -> 401, audit row, session NOT auto-revoked
    (user may have legitimate IP change). Mobile + corporate-NAT
    environments leave this off.
  - CERTCTL_SESSION_BIND_USER_AGENT — same shape against UA.

Configurable lifetimes (env vars wired in internal/config/config.go):

  CERTCTL_SESSION_IDLE_TIMEOUT             1h
  CERTCTL_SESSION_ABSOLUTE_TIMEOUT         8h
  CERTCTL_SESSION_SIGNING_KEY_RETENTION    24h
  CERTCTL_SESSION_GC_INTERVAL              1h
  CERTCTL_SESSION_SAMESITE                 Lax
  CERTCTL_SESSION_BIND_IP                  false
  CERTCTL_SESSION_BIND_USER_AGENT          false

Test surface (internal/auth/session/service_test.go, ~860 LOC):

  All 15 prompt-mandated negative cases:

    1.  Tampered cookie (HMAC byte flipped near segment start where all
        6 bits are real — base64url-no-pad's last char carries only 2
        bits so a tail-flip is unreliable).
    1b. Tampered SESSION_ID segment (same HMAC-recompute outcome).
    2.  Cookie missing v1. prefix.
    3.  Cookie with unknown version prefix (v99).
    4.  Idle expiry — back-dated last_seen_at + idle_expires_at.
    5.  Absolute expiry — back-dated absolute_expires_at.
    6.  Revoked session.
    7.  Wrong signing key id (no row matches).
    8.  Cookie signed under retired-but-in-retention key SUCCEEDS.
    9.  Cookie signed under retired-past-retention key FAILS.
    10. Concatenation collision — direct evidence that
        computeHMAC("abc","de") != computeHMAC("ab","cde") AND that
        a forged-boundary-slide cookie is rejected.
    11. CSRF token missing.
    12. CSRF token mismatch (constant-time compare).
    13. IP-bind enabled + IP changed -> ErrSessionIPMismatch + audit row.
    14. UA-bind enabled + UA changed -> ErrSessionUAMismatch + audit row.
    15. EnsureInitialSigningKey RNG failure -> ErrInitialSigningKeyMintFailed
        wrap (cmd/server/main.go treats as fatal).

  Plus coverage-lift batch covering: every error wrap on every repo
  collaborator (Create, Get, UpdateLastSeen, UpdateCSRFTokenHash,
  Revoke, RevokeAllForActor, GC), every RNG-failure surface in Create /
  RotateCSRFToken / RotateSigningKey, every alg-pinning helper edge,
  the cookie parser's full negative matrix (empty, wrong segment count,
  missing prefixes, bad base64, wrong HMAC length), and a real-encryption
  round-trip via internal/crypto.EncryptIfKeySet -> DecryptIfKeySet so
  the v3-blob path is exercised end-to-end at the session-cookie level.

Coverage:

  internal/auth/session              94.5%  (floor 90)
  internal/auth/session/domain       96+%   (floor 90, Phase 1)

.github/coverage-thresholds.yml extended with 2 new gate entries
(internal/auth/session and internal/auth/session/domain). The
why: paragraphs explain why each fail-closed branch is load-bearing.

Repository extensions:

  internal/repository/session.go gains UpdateCSRFTokenHash on the
  SessionRepository interface; internal/repository/postgres/session.go
  ships the implementation. RotateCSRFToken consumes it.

Scheduler extensions:

  internal/scheduler/scheduler.go gains SessionGarbageCollector
  interface + sessionGC field + sessionGCInterval +
  SetSessionGarbageCollector + SetSessionGCInterval + sessionGCLoop.
  Pattern matches the existing acmeGCLoop: atomic.Bool guard prevents
  concurrent sweeps, sync.WaitGroup tracks for graceful shutdown,
  per-tick context.WithTimeout(1m) bounds a stuck Postgres.

Server wiring:

  cmd/server/main.go constructs sessionService AFTER the bootstrap
  block (post-RBAC backfill) and BEFORE the policy-service block.
  EnsureInitialSigningKey runs immediately; failure is fatal via
  os.Exit(1). The scheduler section wires SetSessionGarbageCollector
  + SetSessionGCInterval alongside the other interval setters and
  emits an Info log so operators can confirm the loop is enabled.

Phase 4 deviation note: Service.GarbageCollect() returns (int, error)
rather than the prompt's literal `error`. The int is the count of
session rows deleted on this sweep; the scheduler discards it (`_, err
:= ...`) but tests + future operator-facing audit rows can read it.
The wider behavior matches the spec exactly.

Verifications: gofmt clean, go vet ./internal/auth/session/...
./internal/scheduler/... ./internal/config/... ./cmd/server/...
./internal/repository/... clean, go test -short -count=1 -race green
across all 3 session packages, full repository + auth + scheduler +
config test sweeps green, no regressions in Bundle 1 packages.
This commit is contained in:
shankar0123
2026-05-10 05:31:24 +00:00
parent 854135dfb7
commit 17b30c1f7f
8 changed files with 2178 additions and 0 deletions
+72
View File
@@ -84,6 +84,14 @@ type ACMEGarbageCollector interface {
GarbageCollect(ctx context.Context) error
}
// SessionGarbageCollector is the interface the scheduler's sessionGCLoop
// invokes once per CERTCTL_SESSION_GC_INTERVAL tick. Concrete impl is
// *session.Service. Sweeps expired post-login + pre-login session rows
// AND retired-past-retention signing-key rows. Auth Bundle 2 Phase 4.
type SessionGarbageCollector interface {
GarbageCollect(ctx context.Context) (int, error)
}
// JobReaperService defines the interface for job timeout reaping used by the scheduler.
type JobReaperService interface {
ReapTimedOutJobs(ctx context.Context, csrTTL, approvalTTL time.Duration) error
@@ -109,6 +117,7 @@ type Scheduler struct {
cloudDiscoveryService CloudDiscoveryServicer
crlCacheService CRLCacheServicer
acmeGC ACMEGarbageCollector
sessionGC SessionGarbageCollector
jobReaper JobReaperService
logger *slog.Logger
@@ -127,6 +136,7 @@ type Scheduler struct {
crlGenerationInterval time.Duration
jobTimeoutInterval time.Duration
acmeGCInterval time.Duration
sessionGCInterval time.Duration
// agentOfflineJobTTL: per-tick threshold for reaping Running jobs whose
// owning agent has been silent. Bundle C / Audit M-016. Defaults below.
agentOfflineJobTTL time.Duration
@@ -148,6 +158,7 @@ type Scheduler struct {
crlGenerationRunning atomic.Bool
jobTimeoutRunning atomic.Bool
acmeGCRunning atomic.Bool
sessionGCRunning atomic.Bool
// Graceful shutdown: wait for in-flight work to complete
wg sync.WaitGroup
@@ -185,6 +196,7 @@ func NewScheduler(
crlGenerationInterval: 1 * time.Hour,
jobTimeoutInterval: 10 * time.Minute,
acmeGCInterval: 1 * time.Minute,
sessionGCInterval: 1 * time.Hour,
// 5 minutes is 5×agentHealthCheckInterval default of 1m; an agent
// must miss multiple heartbeats before its in-flight jobs are reaped.
agentOfflineJobTTL: 5 * time.Minute,
@@ -317,6 +329,23 @@ func (s *Scheduler) SetACMEGCInterval(d time.Duration) {
s.acmeGCInterval = d
}
// SetSessionGarbageCollector wires the Auth Bundle 2 Phase 4 session GC
// service. Optional; nil disables the loop (Bundle-2-disabled deployments
// still run pre-Phase-4 behavior).
func (s *Scheduler) SetSessionGarbageCollector(gc SessionGarbageCollector) {
s.sessionGC = gc
}
// SetSessionGCInterval configures the interval at which the session GC
// sweep runs. Default 1h. Wire: CERTCTL_SESSION_GC_INTERVAL. Zero or
// negative values are ignored.
func (s *Scheduler) SetSessionGCInterval(d time.Duration) {
if d <= 0 {
return
}
s.sessionGCInterval = d
}
// SetAgentOfflineJobTTL sets the threshold past which a Running job whose
// owning agent has gone silent is reaped to Failed. Bundle C / Audit M-016.
// Zero or negative values are ignored (the default of 5 minutes is kept).
@@ -375,6 +404,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
if s.acmeGC != nil {
loopCount++
}
if s.sessionGC != nil {
loopCount++
}
s.wg.Add(loopCount)
go func() { defer s.wg.Done(); s.renewalCheckLoop(ctx) }()
@@ -403,6 +435,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
if s.acmeGC != nil {
go func() { defer s.wg.Done(); s.acmeGCLoop(ctx) }()
}
if s.sessionGC != nil {
go func() { defer s.wg.Done(); s.sessionGCLoop(ctx) }()
}
// Signal that all loops are launched
close(startedChan)
@@ -1146,3 +1181,40 @@ func (s *Scheduler) acmeGCLoop(ctx context.Context) {
}
}
}
// sessionGCLoop runs every sessionGCInterval and invokes
// SessionGarbageCollector.GarbageCollect, which sweeps:
// - sessions whose absolute_expires_at is in the past (post-login expired);
// - pre-login session rows older than 10 minutes;
// - retired-past-retention session_signing_keys rows.
//
// Auth Bundle 2 Phase 4. The atomic.Bool guard + the per-tick
// context.WithTimeout match the pattern of every other loop in this
// file: a stuck Postgres can't block the next tick, and concurrent
// sweeps are skipped not queued.
func (s *Scheduler) sessionGCLoop(ctx context.Context) {
ticker := time.NewTicker(s.sessionGCInterval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
if !s.sessionGCRunning.CompareAndSwap(false, true) {
s.logger.Warn("session GC sweep still running, skipping tick")
continue
}
s.wg.Add(1)
go func() {
defer s.wg.Done()
defer s.sessionGCRunning.Store(false)
opCtx, cancel := context.WithTimeout(ctx, time.Minute)
defer cancel()
if _, err := s.sessionGC.GarbageCollect(opCtx); err != nil {
s.logger.Warn("session gc sweep failed (next tick will retry)", "error", err)
}
}()
}
}
}