2 Commits

Author SHA1 Message Date
shankar0123 a41fc2d75c feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete)
Phase 13 Sprint 13.3 — the completion half of the ARCH-M1
substantive close. Sprint 13.2 shipped the Postgres-backed
sliding-window limiter + multi-replica integration test; Sprint 13.3
wires the 6 call sites in cmd/server/main.go through the operator-
chosen backend selector, adds the rate_limit_buckets scheduler
janitor sweep, rewrites the observability doc, exposes the env-var
in the helm chart, and promotes the multi-replica integration test
to a required CI status check.

Signature ground-truth (sprint 13.2 + 13.3)
===========================================
Prompt-template signatures: `Allow(key string) error` and "5 call
sites." Actual repo: `Allow(key string, now time.Time) error` and 6
NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt
miscounted the second EST per-principal arm). Per CLAUDE.md "the repo
is truth," matched the live shape.

What changed
============

internal/config/server.go (+40 LOC):
  - Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval
    time.Duration` to RateLimitConfig with full operator-facing
    documentation of the two valid values (memory|postgres) +
    when-to-use-which decision tree.

internal/config/config.go (+27 LOC):
  - Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m).
  - Validate() rejects anything other than ""/"memory"/"postgres"
    (empty = memory equivalence for test-built Configs that bypass
    Load()). Janitor interval must be ≥ 1 minute when set.
  - Failure modes return clear ::error:: with the env-var name + the
    valid values, so an operator typo ("postgress" → memory in a
    3-replica cluster) fails fast at startup.

internal/ratelimit/factory.go (NEW, 67 LOC):
  - NewLimiter(backend, db, maxN, window, mapCap) Limiter — single
    factory the 6 cmd/server/main.go call sites route through.
  - Drop-in signature: same maxN/window/mapCap as
    NewSlidingWindowLimiter (mapCap accepted + ignored for postgres
    — the rate_limit_buckets table grows until the janitor sweeps).
  - Defensive panic on unknown backend (config.Validate is SoT;
    this is belt-and-suspenders).

internal/ratelimit/postgres_gc.go (NEW, 73 LOC):
  - PostgresGC struct + NewPostgresGC + GarbageCollect.
  - Single-statement DELETE FROM rate_limit_buckets WHERE
    updated_at < NOW() - maxWindow. Idempotent.
  - maxWindow <= 0 is a no-op (operator opt-out).

internal/scheduler/scheduler.go (+90 LOC):
  - New RateLimitGarbageCollector interface (mirrors the
    ACMEGarbageCollector / SessionGarbageCollector contracts).
  - rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning
    on Scheduler.
  - SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d)
    Setters following the existing acmeGC/sessionGC pattern.
  - rateLimitGCLoop() — JitteredTicker + atomic.Bool guard +
    per-tick context.WithTimeout(1m). Logs row count at Debug.
  - Loop counted in the Start() WaitGroup only when the GC is
    non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector
    when backend=memory so the loop never launches for that case.

cmd/server/main.go (35 LOC diff):
  - All 6 ratelimit.NewSlidingWindowLimiter call sites now route
    through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend,
    db, ...). Grep verification post-fix returns ZERO hits.
  - Six sites: breakglass loginLimiter (580), ocspLimiter (1003),
    exportLimiter (1068), EST failed-basic (1535), EST per-principal
    SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The
    intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved
    — its inner type-alias wrapper is the prompt's
    out-of-scope (cmd/server/*.go only).
  - Conditionally constructs PostgresGC + wires the scheduler janitor
    when backend=postgres; logs the wiring decision either way so
    operators see "rate-limit GC sweep enabled (postgres backend)"
    or "in-memory backend self-prunes" in the boot log.

internal/api/handler/{est,export,certificates,auth_breakglass}.go:
  - Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types
    with ratelimit.Limiter (the interface). Allow() satisfies the
    same call shape on both backends; the in-memory tests that
    construct *SlidingWindowLimiter still compile because the
    concrete type satisfies the interface (compile-time check in
    internal/ratelimit/limiter.go pins this).

docs/operator/observability.md (176 LOC diff):
  - Replaced the "per-process, in-memory, reset-on-restart, not
    shared across replicas" paragraph with the new
    configurable-backend section: operator decision tree,
    backend internals (memory vs postgres), janitor description,
    falsifiable closure proof (the Sprint 13.2 integration test
    name + invocation), helm chart wiring example.
  - Updated inventory to reflect the actual handler file paths +
    actual cap configurations (the prior doc said "60s window" for
    several limiters that actually use 60m / 24h windows).
  - Doc smoke confirmed: grep -c 'per-process, in-memory,
    reset-on-restart' docs/operator/observability.md = 0.

deploy/helm/certctl/values.yaml + templates/server-configmap.yaml +
templates/server-deployment.yaml:
  - Exposed server.rateLimiting.backend (default "memory") +
    server.rateLimiting.janitorInterval (default "5m") under the
    existing rateLimiting block.
  - ConfigMap renders both as rate-limit-backend +
    rate-limit-janitor-interval keys.
  - Deployment wires CERTCTL_RATE_LIMIT_BACKEND +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap.
  - Helm render: `helm template deploy/helm/certctl --set
    server.rateLimiting.backend=postgres` shows the env-var on the
    server-deployment.yaml output.

.github/workflows/ci.yml (+12 LOC):
  - Added a new step in the Go Build & Test job that runs the
    Sprint 13.2 multi-replica integration test
    (TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with
    -tags=integration -race -timeout=300s. Fails the CI status check
    if the cross-replica row lock ever stops arbitrating across
    replicas — the ARCH-M1 closure regression gate.

Verification (all green locally; postgres integration via CI)
============================================================

  $ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go
    (zero hits — Sprint 13.3 receipt)

  $ go test -short -count=1 \
      ./internal/config/... ./internal/ratelimit/... \
      ./internal/scheduler/... ./internal/api/handler/... \
      ./cmd/server/...
    ok  internal/config       1.177s
    ok  internal/ratelimit    0.007s
    ok  internal/scheduler    9.165s
    ok  internal/api/handler  6.245s
    ok  cmd/server            0.390s

  $ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \
      ./internal/config/... ./internal/api/handler/... ./cmd/server/...
    (clean)

  $ gofmt -l internal/ cmd/server/
    (clean)

  $ grep -c 'per-process, in-memory, reset-on-restart' \
      docs/operator/observability.md
    0   (doc smoke — the audit's verbatim phrasing is gone)

  $ bash scripts/ci-guards/G-3-env-docs-drift.sh
    G-3 env-docs-drift: clean.

  $ bash scripts/ci-guards/complete-path-config-coverage.sh
    OK — every CERTCTL_* env var (197) has at least one non-config-
    package consumer.

Selector contract verified — config.Validate() rejects any value
other than ""/memory/postgres at startup with a clear error message.

Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a
different axis; ARCH-M1 closure is complete with this commit
modulo the Sprint 13.7 audit-HTML flip + zero-floor pin.

Closes: ARCH-M1 substantive remediation. The cross-replica rate-
limit-cap-enforcement gap that the audit recommended deferring to
v3 is closed; operators with server.replicas > 1 flip
CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement
across the cluster (proved by the multi-replica integration test now
gating CI).
2026-05-14 11:52:13 +00:00
shankar0123 aaddd31d20 refactor(config): extract Server family + isLoopbackAddr helper (Phase 9, 6 of N)
Continuing Phase 9 ARCH-M2 closure. Sprint 6 groups the server-tier
infrastructure structs (the things that configure HOW the server
runs) and the HIGH-12 demo-mode startup-guard helper that exclusively
serves the ServerConfig.Host gate.

What moved
==========
  internal/config/server.go (new, 374 lines including BSL header +
                            Phase 9 doc-comment + 2 imports +
                            7 structs + 1 unexported helper)

Seven structs:
  - ServerConfig       (HTTP listener: Host, Port, MaxBodySize,
                        TLS sub-struct, AuditFlushTimeoutSeconds)
  - ServerTLSConfig    (HTTPS-only TLS material: CertPath + KeyPath)
  - DatabaseConfig     (URL + MaxConnections + MigrationsPath +
                        DemoSeed)
  - SchedulerConfig    (all 15 scheduler-loop tunables: RenewalCheck,
                        JobProcessor, RenewalConcurrency, agent-health,
                        notification-process + retry, retry-interval,
                        job-timeout, AwaitingCSR + Approval timeouts,
                        short-lived-expiry, CRL-generation, OCSP-rate-
                        limit, cert-export-rate-limit, deploy-backup-
                        retention, K8s-kubelet-sync-timeout)
  - LogConfig          (Level + Format)
  - RateLimitConfig    (Enabled + RPS + BurstSize + per-user
                        overrides)
  - CORSConfig         (AllowedOrigins — empty deny-by-default)

One unexported helper:
  - isLoopbackAddr()   (HIGH-12 demo-mode guard: 127.0.0.1, ::1,
                        and "localhost" return true; 0.0.0.0, ::,
                        and non-localhost hostnames return false.
                        Same-package callers: Validate() in config.go
                        + isLoopbackAddr_test in config_test.go,
                        both unaffected by the move.)

Three sed passes (highest line numbers first so positions don't shift)
======================================================================
The edit was performed via three independent sed deletes from
highest-line to lowest-line so each delete's range references the
file's pre-shift line numbers:

  1. sed -i '1924,1963d'  — deleted isLoopbackAddr (40 lines)
  2. sed -i '834,893d'    — deleted LogConfig + RateLimitConfig +
                            CORSConfig (60 lines)
  3. sed -i '624,810d'    — deleted ServerConfig + ServerTLSConfig +
                            DatabaseConfig + SchedulerConfig
                            (187 lines)

Total: 287 lines deleted. Reverse-order matters because each delete
shifts subsequent line numbers; doing them top-down would require
re-deriving every range mid-edit.

Why ApprovalConfig stayed in config.go
=======================================
ApprovalConfig (RBAC-related — issuance-approval workflow) sits
between SchedulerConfig and LogConfig in the original file ordering.
It's NOT server-tier infrastructure — it belongs with the Auth/RBAC
surface. Sprint 6's sed ranges deliberately preserve it where it
lives. Operator may want to fold it into a future Auth-followup cut
if the approval surface needs to live adjacent to AuthConfig.

Import-graph hygiene
====================
isLoopbackAddr was the ONLY user of `net` in config.go (verified via
`grep -nE '\bnet\.' internal/config/config.go` → 2 hits, both inside
isLoopbackAddr's body). After the move, config.go's `net` import
becomes unused — would have failed `go vet`. This commit removes the
`net` line from config.go's import block. server.go imports `net`
directly. The `time` import in config.go stays because the still-
in-place OCSPResponderConfig / DigestConfig / HealthCheckConfig /
NetworkScanConfig / VerificationConfig / per-vendor-issuer configs
all reference `time.Duration`.

Public-surface invariant
========================
Every type, exported field, and doc-comment is byte-identical to
pre-split. Package stays `config`. Every external caller of
`config.ServerConfig` / `config.ServerTLSConfig` / `config.DatabaseConfig`
/ `config.SchedulerConfig` / `config.LogConfig` / `config.RateLimitConfig`
/ `config.CORSConfig` resolves the same way. The unexported
isLoopbackAddr is invisible to external consumers; its same-package
callers (Validate, the test) continue to resolve via the package
symbol table.

Verification (all clean):
  gofmt -l internal/config/                  → clean
  go build ./internal/config/...             → clean
  go test ./internal/config/... -count=1     → ok (0.68s)
  staticcheck ./internal/config/...          → clean
  go build ./cmd/server/...
          ./internal/auth/...
          ./internal/api/router/...
          ./internal/api/handler/...
          ./internal/scheduler/...           → clean (the critical
                                              broader-importer check)
  grep -nE '^type (ServerConfig|ServerTLSConfig|DatabaseConfig|SchedulerConfig|LogConfig|RateLimitConfig|CORSConfig)|^func isLoopbackAddr' internal/config/config.go
    → empty (none remain in config.go)
  grep -nE '^type (ServerConfig|ServerTLSConfig|DatabaseConfig|SchedulerConfig|LogConfig|RateLimitConfig|CORSConfig)|^func isLoopbackAddr' internal/config/server.go
    → 7 types + 1 func (correct)
  grep -nE '\bnet\.' internal/config/config.go
    → empty (the import-removal was load-bearing)

LOC delta:
  config.go:  1963 → 1673  (-290 lines: -287 from three sed cuts,
                                        -1 from import-block
                                          line removal,
                                        -2 from misc gofmt cleanup)
  server.go:  new, 374 lines (incl. 87-line Phase 9 doc-comment +
                              BSL header + package decl + 2 imports)

Cumulative Phase 9 progress (Sprints 1+2+3+4+5+6 from config.go):
  Pre-Phase-9:                3403 LOC
  After Sprint 1 (Notifier):  3335 LOC  (-68)
  After Sprint 2 (ACME):      3108 LOC  (-227)
  After Sprint 3 (SCEP):      2774 LOC  (-334)
  After Sprint 4 (EST):       2467 LOC  (-307)
  After Sprint 5 (Auth):      1963 LOC  (-504)
  After Sprint 6 (Server):    1673 LOC  (-290)
  Total Sprint 1+2+3+4+5+6:  -1730 LOC  (-50.8%)

Notable milestone: config.go has now lost MORE than HALF its original
lines (-50.8%). One more cut from config.go remains (Sprint 7 ~600
LOC of per-vendor issuer configs) before the file split moves on to
non-config hotspots (Sprints 8-12).

Pattern lesson — import-graph cleanup
======================================
Splits that move the LAST consumer of an import need to remove the
import from the source file or `go vet` / build will fail. The check
is `grep -nE '\bnet\.' internal/config/config.go` (or whichever
package) before commit — if empty, drop the import line. Past
sprints didn't hit this because the moved-out helpers used only
shared packages (`strings`, `os`, `fmt`, `time`) that other code in
config.go still uses. Sprint 6's `net` removal is the first
import-rebalancing in Phase 9.

Three-pass sed pattern (also new in Sprint 6)
=============================================
Prior sprints did one or two sed deletes. Sprint 6 needed three
because the Server-family structs straddled ApprovalConfig and
isLoopbackAddr lived far from the struct block. Doing them
highest-line-first means each range references pre-shift line
numbers — no mid-edit re-derivation required.

Next queued (Sprint 7): Issuers family from config.go →
internal/config/issuers.go (~600 LOC). Includes KeygenConfig +
CAConfig + the ten per-vendor configs (StepCA, Vault, DigiCert,
Sectigo, GoogleCAS, AWSACMPCA, Entrust, GlobalSign, EJBCA, OpenSSL).
This is the LAST config.go cut of Phase 9; after Sprint 7 ships,
config.go should drop to ~1100-1200 LOC and the remaining splits
target non-config hotspots (cmd/server/main.go, service/acme.go,
mcp/tools.go, auth_session_oidc.go, cmd/agent/main.go).

Closes: cowork/certctl-architecture-diligence-audit.html#fix-ARCH-M2
        (partial — 6 of 12 — full ARCH-M2 closure is the aggregate)
2026-05-14 04:45:16 +00:00