Files
shankar0123 a41fc2d75c feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete)
Phase 13 Sprint 13.3 — the completion half of the ARCH-M1
substantive close. Sprint 13.2 shipped the Postgres-backed
sliding-window limiter + multi-replica integration test; Sprint 13.3
wires the 6 call sites in cmd/server/main.go through the operator-
chosen backend selector, adds the rate_limit_buckets scheduler
janitor sweep, rewrites the observability doc, exposes the env-var
in the helm chart, and promotes the multi-replica integration test
to a required CI status check.

Signature ground-truth (sprint 13.2 + 13.3)
===========================================
Prompt-template signatures: `Allow(key string) error` and "5 call
sites." Actual repo: `Allow(key string, now time.Time) error` and 6
NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt
miscounted the second EST per-principal arm). Per CLAUDE.md "the repo
is truth," matched the live shape.

What changed
============

internal/config/server.go (+40 LOC):
  - Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval
    time.Duration` to RateLimitConfig with full operator-facing
    documentation of the two valid values (memory|postgres) +
    when-to-use-which decision tree.

internal/config/config.go (+27 LOC):
  - Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m).
  - Validate() rejects anything other than ""/"memory"/"postgres"
    (empty = memory equivalence for test-built Configs that bypass
    Load()). Janitor interval must be ≥ 1 minute when set.
  - Failure modes return clear ::error:: with the env-var name + the
    valid values, so an operator typo ("postgress" → memory in a
    3-replica cluster) fails fast at startup.

internal/ratelimit/factory.go (NEW, 67 LOC):
  - NewLimiter(backend, db, maxN, window, mapCap) Limiter — single
    factory the 6 cmd/server/main.go call sites route through.
  - Drop-in signature: same maxN/window/mapCap as
    NewSlidingWindowLimiter (mapCap accepted + ignored for postgres
    — the rate_limit_buckets table grows until the janitor sweeps).
  - Defensive panic on unknown backend (config.Validate is SoT;
    this is belt-and-suspenders).

internal/ratelimit/postgres_gc.go (NEW, 73 LOC):
  - PostgresGC struct + NewPostgresGC + GarbageCollect.
  - Single-statement DELETE FROM rate_limit_buckets WHERE
    updated_at < NOW() - maxWindow. Idempotent.
  - maxWindow <= 0 is a no-op (operator opt-out).

internal/scheduler/scheduler.go (+90 LOC):
  - New RateLimitGarbageCollector interface (mirrors the
    ACMEGarbageCollector / SessionGarbageCollector contracts).
  - rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning
    on Scheduler.
  - SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d)
    Setters following the existing acmeGC/sessionGC pattern.
  - rateLimitGCLoop() — JitteredTicker + atomic.Bool guard +
    per-tick context.WithTimeout(1m). Logs row count at Debug.
  - Loop counted in the Start() WaitGroup only when the GC is
    non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector
    when backend=memory so the loop never launches for that case.

cmd/server/main.go (35 LOC diff):
  - All 6 ratelimit.NewSlidingWindowLimiter call sites now route
    through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend,
    db, ...). Grep verification post-fix returns ZERO hits.
  - Six sites: breakglass loginLimiter (580), ocspLimiter (1003),
    exportLimiter (1068), EST failed-basic (1535), EST per-principal
    SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The
    intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved
    — its inner type-alias wrapper is the prompt's
    out-of-scope (cmd/server/*.go only).
  - Conditionally constructs PostgresGC + wires the scheduler janitor
    when backend=postgres; logs the wiring decision either way so
    operators see "rate-limit GC sweep enabled (postgres backend)"
    or "in-memory backend self-prunes" in the boot log.

internal/api/handler/{est,export,certificates,auth_breakglass}.go:
  - Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types
    with ratelimit.Limiter (the interface). Allow() satisfies the
    same call shape on both backends; the in-memory tests that
    construct *SlidingWindowLimiter still compile because the
    concrete type satisfies the interface (compile-time check in
    internal/ratelimit/limiter.go pins this).

docs/operator/observability.md (176 LOC diff):
  - Replaced the "per-process, in-memory, reset-on-restart, not
    shared across replicas" paragraph with the new
    configurable-backend section: operator decision tree,
    backend internals (memory vs postgres), janitor description,
    falsifiable closure proof (the Sprint 13.2 integration test
    name + invocation), helm chart wiring example.
  - Updated inventory to reflect the actual handler file paths +
    actual cap configurations (the prior doc said "60s window" for
    several limiters that actually use 60m / 24h windows).
  - Doc smoke confirmed: grep -c 'per-process, in-memory,
    reset-on-restart' docs/operator/observability.md = 0.

deploy/helm/certctl/values.yaml + templates/server-configmap.yaml +
templates/server-deployment.yaml:
  - Exposed server.rateLimiting.backend (default "memory") +
    server.rateLimiting.janitorInterval (default "5m") under the
    existing rateLimiting block.
  - ConfigMap renders both as rate-limit-backend +
    rate-limit-janitor-interval keys.
  - Deployment wires CERTCTL_RATE_LIMIT_BACKEND +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap.
  - Helm render: `helm template deploy/helm/certctl --set
    server.rateLimiting.backend=postgres` shows the env-var on the
    server-deployment.yaml output.

.github/workflows/ci.yml (+12 LOC):
  - Added a new step in the Go Build & Test job that runs the
    Sprint 13.2 multi-replica integration test
    (TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with
    -tags=integration -race -timeout=300s. Fails the CI status check
    if the cross-replica row lock ever stops arbitrating across
    replicas — the ARCH-M1 closure regression gate.

Verification (all green locally; postgres integration via CI)
============================================================

  $ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go
    (zero hits — Sprint 13.3 receipt)

  $ go test -short -count=1 \
      ./internal/config/... ./internal/ratelimit/... \
      ./internal/scheduler/... ./internal/api/handler/... \
      ./cmd/server/...
    ok  internal/config       1.177s
    ok  internal/ratelimit    0.007s
    ok  internal/scheduler    9.165s
    ok  internal/api/handler  6.245s
    ok  cmd/server            0.390s

  $ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \
      ./internal/config/... ./internal/api/handler/... ./cmd/server/...
    (clean)

  $ gofmt -l internal/ cmd/server/
    (clean)

  $ grep -c 'per-process, in-memory, reset-on-restart' \
      docs/operator/observability.md
    0   (doc smoke — the audit's verbatim phrasing is gone)

  $ bash scripts/ci-guards/G-3-env-docs-drift.sh
    G-3 env-docs-drift: clean.

  $ bash scripts/ci-guards/complete-path-config-coverage.sh
    OK — every CERTCTL_* env var (197) has at least one non-config-
    package consumer.

Selector contract verified — config.Validate() rejects any value
other than ""/memory/postgres at startup with a clear error message.

Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a
different axis; ARCH-M1 closure is complete with this commit
modulo the Sprint 13.7 audit-HTML flip + zero-floor pin.

Closes: ARCH-M1 substantive remediation. The cross-replica rate-
limit-cap-enforcement gap that the audit recommended deferring to
v3 is closed; operators with server.replicas > 1 flip
CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement
across the cluster (proved by the multi-replica integration test now
gating CI).
2026-05-14 11:52:13 +00:00

366 lines
14 KiB
Go

// Copyright 2026 certctl LLC. All rights reserved.
// SPDX-License-Identifier: BUSL-1.1
// Package handler — Auth Bundle 2 Phase 7.5 / break-glass admin HTTP surface.
//
// 4 endpoints across two access levels:
//
// 1. Public (auth-bypass; the whole point is to log in WITHOUT
// existing creds):
// POST /auth/breakglass/login
// Rate-limited at 5/minute per source IP via the existing
// rate limiter middleware. When CERTCTL_BREAKGLASS_ENABLED=false,
// returns 404 (NOT 403) so the surface is invisible to scanners.
//
// 2. RBAC-gated (auth.breakglass.admin):
// POST /api/v1/auth/breakglass/credentials
// POST /api/v1/auth/breakglass/credentials/{actor_id}/unlock
// DELETE /api/v1/auth/breakglass/credentials/{actor_id}
//
// The handler delegates to internal/auth/breakglass.Service for the
// load-bearing logic (Argon2id hashing, lockout state machine,
// constant-time-compare, identical-shape errors). This file is purely
// HTTP shape — request-binding, status-code mapping, audit attribution
// for the caller-actor-id wire-up.
package handler
import (
"context"
"encoding/json"
"errors"
"net/http"
"strings"
"time"
"github.com/certctl-io/certctl/internal/auth/breakglass"
bgdomain "github.com/certctl-io/certctl/internal/auth/breakglass/domain"
sessiondomain "github.com/certctl-io/certctl/internal/auth/session/domain"
"github.com/certctl-io/certctl/internal/ratelimit"
)
// =============================================================================
// AuthBreakglassHandler.
// =============================================================================
// BreakglassService is the projection of *breakglass.Service the
// handler consumes. Defining the projection here keeps the handler
// stub-friendly + decoupled from the wider service surface.
type BreakglassService interface {
Enabled() bool
SetPassword(ctx context.Context, callerActorID, targetActorID, plaintext string) (*breakglass.SetPasswordResult, error)
Authenticate(ctx context.Context, actorID, plaintext, ip, userAgent string) (*breakglass.AuthenticateResult, error)
Unlock(ctx context.Context, callerActorID, targetActorID string) error
RemoveCredential(ctx context.Context, callerActorID, targetActorID string) error
List(ctx context.Context) ([]*bgdomain.BreakglassCredential, error)
}
// AuthBreakglassHandler ships the Phase 7.5 surface.
//
// Bundle 5 closure (S1): the docstring at the top of this file claimed
// the login endpoint was "Rate-limited at 5/minute per source IP via
// the existing rate limiter middleware" but no per-route limiter was
// wired — `/auth/breakglass/login` is registered via `r.mux.Handle`
// in router.go::AuthExemptRouterRoutes and bypasses the global RPS
// middleware that wraps `r.Register`-mounted routes. The login handler
// now owns its own SlidingWindowLimiter (5 attempts / minute / source
// IP, 50 000 key cap) so the documented behavior actually ships.
//
// Wired at startup via SetLoginRateLimiter (called from cmd/server/main.go
// alongside the other per-handler rate limiters that close audit
// findings H-9 / H-12 / Bundle 3 D7 / etc.). Defense-in-depth: even
// when the limiter is nil (legacy / test), the service-layer Argon2id
// lockout state machine still protects against brute force — but a
// nil limiter is a misconfiguration the integration test catches.
type AuthBreakglassHandler struct {
svc BreakglassService
cookieAttrs SessionCookieAttrs
// loginLimiter rate-limits POST /auth/breakglass/login by source IP.
// nil-safe: when unset, the handler skips the limiter check and
// relies on the service-layer Argon2id lockout. Production deploys
// MUST set this via SetLoginRateLimiter.
loginLimiter ratelimit.Limiter
}
// NewAuthBreakglassHandler constructs the handler.
func NewAuthBreakglassHandler(svc BreakglassService, cookieAttrs SessionCookieAttrs) *AuthBreakglassHandler {
return &AuthBreakglassHandler{svc: svc, cookieAttrs: cookieAttrs}
}
// SetLoginRateLimiter wires the per-source-IP rate limiter the Login
// handler enforces. Bundle 5 closure (S1) — see the AuthBreakglassHandler
// type docstring for the full rationale.
func (h *AuthBreakglassHandler) SetLoginRateLimiter(l ratelimit.Limiter) {
h.loginLimiter = l
}
// =============================================================================
// 1. Public login endpoint.
// =============================================================================
type breakglassLoginRequest struct {
ActorID string `json:"actor_id"`
Password string `json:"password"`
}
// Login handles POST /auth/breakglass/login.
//
// Auth-bypass — the whole point is to log in WITHOUT existing creds.
// When Service.Enabled() == false, returns 404 (NOT 403) so the surface
// is invisible to scanners. On success, sets the post-login session
// cookie + CSRF cookie + 204 No Content. On any failure (wrong password,
// locked account, no credential, unknown actor): uniform 401 + identical
// timing.
func (h *AuthBreakglassHandler) Login(w http.ResponseWriter, r *http.Request) {
if h.svc == nil || !h.svc.Enabled() {
// Surface invisibility — 404 (NOT 403) per Phase 7.5 spec.
http.NotFound(w, r)
return
}
var req breakglassLoginRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
// Even invalid JSON returns 401 (identical to wrong-password) —
// no scanner-friendly 400 that distinguishes "wrong shape" vs
// "wrong password".
Error(w, http.StatusUnauthorized, "invalid credentials")
return
}
if strings.TrimSpace(req.ActorID) == "" || req.Password == "" {
Error(w, http.StatusUnauthorized, "invalid credentials")
return
}
ip := clientIPFromRequest(r)
// Bundle 5 closure (S1): per-source-IP rate limit. 5 attempts /
// minute / IP (default; configurable via the constructor at
// cmd/server/main.go). Returns 429 with no body so the response
// shape matches the rest of the auth surface (scanner-unfriendly).
// Audited by the service layer on the next attempt — we don't
// audit the rate-limit hit itself here because that would let an
// attacker flood the audit table with rate-limit rows from a
// single IP.
if h.loginLimiter != nil {
if err := h.loginLimiter.Allow(ip, time.Now()); err != nil {
Error(w, http.StatusTooManyRequests, "too many requests")
return
}
}
res, err := h.svc.Authenticate(r.Context(), req.ActorID, req.Password, ip, r.UserAgent())
if err != nil {
// All authenticate errors map to the SAME 401 + same body.
// The service has already audited the specific failure category.
Error(w, http.StatusUnauthorized, "invalid credentials")
return
}
// Set the post-login session cookie + CSRF cookie. Same attributes
// as the OIDC callback handler in auth_session_oidc.go; we
// duplicate the 8-line cookie-set block here so the break-glass
// handler doesn't import the OIDC handler package.
now := time.Now().UTC()
expires := now.Add(8 * time.Hour) // matches default SessionConfig.AbsoluteTimeout
http.SetCookie(w, &http.Cookie{
Name: sessiondomain.PostLoginCookieName,
Value: res.CookieValue,
Path: "/",
Expires: expires,
Secure: h.cookieAttrs.Secure,
HttpOnly: true,
SameSite: h.cookieAttrs.SameSite,
})
http.SetCookie(w, &http.Cookie{
Name: sessiondomain.CSRFCookieName,
Value: res.CSRFToken,
Path: "/",
Expires: expires,
Secure: h.cookieAttrs.Secure,
HttpOnly: false, // intentional — GUI must read it
SameSite: h.cookieAttrs.SameSite,
})
w.WriteHeader(http.StatusNoContent)
}
// =============================================================================
// 2. Admin endpoints.
// =============================================================================
type breakglassSetPasswordRequest struct {
ActorID string `json:"actor_id"`
Password string `json:"password"`
}
// SetPassword handles POST /api/v1/auth/breakglass/credentials.
// Permission: auth.breakglass.admin (gated at the router via rbacGate).
//
// When Service.Enabled() == false, returns 404 — admin endpoints share
// the surface-invisibility property with the login endpoint so an
// attacker probing for break-glass via the admin surface gets the same
// signal as probing the login endpoint.
func (h *AuthBreakglassHandler) SetPassword(w http.ResponseWriter, r *http.Request) {
if h.svc == nil || !h.svc.Enabled() {
http.NotFound(w, r)
return
}
caller, err := callerFromRequest(r)
if err != nil {
writeAuthError(w, err)
return
}
var req breakglassSetPasswordRequest
if derr := json.NewDecoder(r.Body).Decode(&req); derr != nil {
Error(w, http.StatusBadRequest, "invalid JSON body")
return
}
res, serr := h.svc.SetPassword(r.Context(), caller.ActorID, req.ActorID, req.Password)
if serr != nil {
switch {
case errors.Is(serr, breakglass.ErrWeakPassword):
Error(w, http.StatusBadRequest, "password fails strength requirements (min 12 bytes, max 256 bytes)")
case errors.Is(serr, breakglass.ErrUnauthenticated):
Error(w, http.StatusUnauthorized, "Authentication required")
case errors.Is(serr, breakglass.ErrDisabled):
http.NotFound(w, r)
default:
Error(w, http.StatusInternalServerError, "could not set password")
}
return
}
writeJSON(w, http.StatusCreated, map[string]interface{}{
"actor_id": res.ActorID,
"created_at": res.CreatedAt.Format(time.RFC3339),
})
}
// Unlock handles POST /api/v1/auth/breakglass/credentials/{actor_id}/unlock.
// Permission: auth.breakglass.admin.
func (h *AuthBreakglassHandler) Unlock(w http.ResponseWriter, r *http.Request) {
if h.svc == nil || !h.svc.Enabled() {
http.NotFound(w, r)
return
}
caller, err := callerFromRequest(r)
if err != nil {
writeAuthError(w, err)
return
}
targetID := r.PathValue("actor_id")
if targetID == "" {
Error(w, http.StatusBadRequest, "missing actor_id path param")
return
}
if uerr := h.svc.Unlock(r.Context(), caller.ActorID, targetID); uerr != nil {
switch {
case errors.Is(uerr, breakglass.ErrDisabled):
http.NotFound(w, r)
case errors.Is(uerr, breakglass.ErrUnauthenticated):
Error(w, http.StatusUnauthorized, "Authentication required")
default:
// repository.ErrBreakglassNotFound surfaces as a wrapped
// error here; we map to 404 via string match to avoid
// importing repository.
if strings.Contains(uerr.Error(), "not found") {
Error(w, http.StatusNotFound, "credential not found")
} else {
Error(w, http.StatusInternalServerError, "could not unlock credential")
}
}
return
}
w.WriteHeader(http.StatusNoContent)
}
// Remove handles DELETE /api/v1/auth/breakglass/credentials/{actor_id}.
// Permission: auth.breakglass.admin.
func (h *AuthBreakglassHandler) Remove(w http.ResponseWriter, r *http.Request) {
if h.svc == nil || !h.svc.Enabled() {
http.NotFound(w, r)
return
}
caller, err := callerFromRequest(r)
if err != nil {
writeAuthError(w, err)
return
}
targetID := r.PathValue("actor_id")
if targetID == "" {
Error(w, http.StatusBadRequest, "missing actor_id path param")
return
}
if rerr := h.svc.RemoveCredential(r.Context(), caller.ActorID, targetID); rerr != nil {
switch {
case errors.Is(rerr, breakglass.ErrDisabled):
http.NotFound(w, r)
case errors.Is(rerr, breakglass.ErrUnauthenticated):
Error(w, http.StatusUnauthorized, "Authentication required")
default:
if strings.Contains(rerr.Error(), "not found") {
Error(w, http.StatusNotFound, "credential not found")
} else {
Error(w, http.StatusInternalServerError, "could not remove credential")
}
}
return
}
w.WriteHeader(http.StatusNoContent)
}
// breakglassCredentialResponse is the wire shape returned by ListCredentials.
// Intentionally omits PasswordHash — the admin GUI only needs metadata to
// render the credentialed-actor table.
type breakglassCredentialResponse struct {
ActorID string `json:"actor_id"`
CreatedAt string `json:"created_at"`
LastPasswordChangeAt string `json:"last_password_change_at"`
FailureCount int `json:"failure_count"`
LockedUntil *string `json:"locked_until,omitempty"`
LastFailureAt *string `json:"last_failure_at,omitempty"`
}
type listBreakglassCredentialsResponse struct {
Credentials []breakglassCredentialResponse `json:"credentials"`
}
// ListCredentials handles GET /api/v1/auth/breakglass/credentials.
// Permission: auth.breakglass.admin.
//
// Audit 2026-05-10 CRIT-4 closure — backs the admin GUI Break-glass
// page. Returns 404 when CERTCTL_BREAKGLASS_ENABLED=false (surface
// invisibility, consistent with the other break-glass admin endpoints).
// The password hash is NEVER serialized to the wire.
func (h *AuthBreakglassHandler) ListCredentials(w http.ResponseWriter, r *http.Request) {
if h.svc == nil || !h.svc.Enabled() {
http.NotFound(w, r)
return
}
creds, err := h.svc.List(r.Context())
if err != nil {
if errors.Is(err, breakglass.ErrDisabled) {
http.NotFound(w, r)
return
}
Error(w, http.StatusInternalServerError, "could not list break-glass credentials")
return
}
resp := listBreakglassCredentialsResponse{Credentials: make([]breakglassCredentialResponse, 0, len(creds))}
for _, c := range creds {
row := breakglassCredentialResponse{
ActorID: c.ActorID,
CreatedAt: c.CreatedAt.UTC().Format(time.RFC3339),
LastPasswordChangeAt: c.LastPasswordChangeAt.UTC().Format(time.RFC3339),
FailureCount: c.FailureCount,
}
if c.LockedUntil != nil {
s := c.LockedUntil.UTC().Format(time.RFC3339)
row.LockedUntil = &s
}
if c.LastFailureAt != nil {
s := c.LastFailureAt.UTC().Format(time.RFC3339)
row.LastFailureAt = &s
}
resp.Credentials = append(resp.Credentials, row)
}
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(resp)
}