Files
certctl/docs/operator/observability.md
T
shankar0123 a41fc2d75c feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete)
Phase 13 Sprint 13.3 — the completion half of the ARCH-M1
substantive close. Sprint 13.2 shipped the Postgres-backed
sliding-window limiter + multi-replica integration test; Sprint 13.3
wires the 6 call sites in cmd/server/main.go through the operator-
chosen backend selector, adds the rate_limit_buckets scheduler
janitor sweep, rewrites the observability doc, exposes the env-var
in the helm chart, and promotes the multi-replica integration test
to a required CI status check.

Signature ground-truth (sprint 13.2 + 13.3)
===========================================
Prompt-template signatures: `Allow(key string) error` and "5 call
sites." Actual repo: `Allow(key string, now time.Time) error` and 6
NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt
miscounted the second EST per-principal arm). Per CLAUDE.md "the repo
is truth," matched the live shape.

What changed
============

internal/config/server.go (+40 LOC):
  - Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval
    time.Duration` to RateLimitConfig with full operator-facing
    documentation of the two valid values (memory|postgres) +
    when-to-use-which decision tree.

internal/config/config.go (+27 LOC):
  - Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m).
  - Validate() rejects anything other than ""/"memory"/"postgres"
    (empty = memory equivalence for test-built Configs that bypass
    Load()). Janitor interval must be ≥ 1 minute when set.
  - Failure modes return clear ::error:: with the env-var name + the
    valid values, so an operator typo ("postgress" → memory in a
    3-replica cluster) fails fast at startup.

internal/ratelimit/factory.go (NEW, 67 LOC):
  - NewLimiter(backend, db, maxN, window, mapCap) Limiter — single
    factory the 6 cmd/server/main.go call sites route through.
  - Drop-in signature: same maxN/window/mapCap as
    NewSlidingWindowLimiter (mapCap accepted + ignored for postgres
    — the rate_limit_buckets table grows until the janitor sweeps).
  - Defensive panic on unknown backend (config.Validate is SoT;
    this is belt-and-suspenders).

internal/ratelimit/postgres_gc.go (NEW, 73 LOC):
  - PostgresGC struct + NewPostgresGC + GarbageCollect.
  - Single-statement DELETE FROM rate_limit_buckets WHERE
    updated_at < NOW() - maxWindow. Idempotent.
  - maxWindow <= 0 is a no-op (operator opt-out).

internal/scheduler/scheduler.go (+90 LOC):
  - New RateLimitGarbageCollector interface (mirrors the
    ACMEGarbageCollector / SessionGarbageCollector contracts).
  - rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning
    on Scheduler.
  - SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d)
    Setters following the existing acmeGC/sessionGC pattern.
  - rateLimitGCLoop() — JitteredTicker + atomic.Bool guard +
    per-tick context.WithTimeout(1m). Logs row count at Debug.
  - Loop counted in the Start() WaitGroup only when the GC is
    non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector
    when backend=memory so the loop never launches for that case.

cmd/server/main.go (35 LOC diff):
  - All 6 ratelimit.NewSlidingWindowLimiter call sites now route
    through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend,
    db, ...). Grep verification post-fix returns ZERO hits.
  - Six sites: breakglass loginLimiter (580), ocspLimiter (1003),
    exportLimiter (1068), EST failed-basic (1535), EST per-principal
    SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The
    intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved
    — its inner type-alias wrapper is the prompt's
    out-of-scope (cmd/server/*.go only).
  - Conditionally constructs PostgresGC + wires the scheduler janitor
    when backend=postgres; logs the wiring decision either way so
    operators see "rate-limit GC sweep enabled (postgres backend)"
    or "in-memory backend self-prunes" in the boot log.

internal/api/handler/{est,export,certificates,auth_breakglass}.go:
  - Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types
    with ratelimit.Limiter (the interface). Allow() satisfies the
    same call shape on both backends; the in-memory tests that
    construct *SlidingWindowLimiter still compile because the
    concrete type satisfies the interface (compile-time check in
    internal/ratelimit/limiter.go pins this).

docs/operator/observability.md (176 LOC diff):
  - Replaced the "per-process, in-memory, reset-on-restart, not
    shared across replicas" paragraph with the new
    configurable-backend section: operator decision tree,
    backend internals (memory vs postgres), janitor description,
    falsifiable closure proof (the Sprint 13.2 integration test
    name + invocation), helm chart wiring example.
  - Updated inventory to reflect the actual handler file paths +
    actual cap configurations (the prior doc said "60s window" for
    several limiters that actually use 60m / 24h windows).
  - Doc smoke confirmed: grep -c 'per-process, in-memory,
    reset-on-restart' docs/operator/observability.md = 0.

deploy/helm/certctl/values.yaml + templates/server-configmap.yaml +
templates/server-deployment.yaml:
  - Exposed server.rateLimiting.backend (default "memory") +
    server.rateLimiting.janitorInterval (default "5m") under the
    existing rateLimiting block.
  - ConfigMap renders both as rate-limit-backend +
    rate-limit-janitor-interval keys.
  - Deployment wires CERTCTL_RATE_LIMIT_BACKEND +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap.
  - Helm render: `helm template deploy/helm/certctl --set
    server.rateLimiting.backend=postgres` shows the env-var on the
    server-deployment.yaml output.

.github/workflows/ci.yml (+12 LOC):
  - Added a new step in the Go Build & Test job that runs the
    Sprint 13.2 multi-replica integration test
    (TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with
    -tags=integration -race -timeout=300s. Fails the CI status check
    if the cross-replica row lock ever stops arbitrating across
    replicas — the ARCH-M1 closure regression gate.

Verification (all green locally; postgres integration via CI)
============================================================

  $ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go
    (zero hits — Sprint 13.3 receipt)

  $ go test -short -count=1 \
      ./internal/config/... ./internal/ratelimit/... \
      ./internal/scheduler/... ./internal/api/handler/... \
      ./cmd/server/...
    ok  internal/config       1.177s
    ok  internal/ratelimit    0.007s
    ok  internal/scheduler    9.165s
    ok  internal/api/handler  6.245s
    ok  cmd/server            0.390s

  $ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \
      ./internal/config/... ./internal/api/handler/... ./cmd/server/...
    (clean)

  $ gofmt -l internal/ cmd/server/
    (clean)

  $ grep -c 'per-process, in-memory, reset-on-restart' \
      docs/operator/observability.md
    0   (doc smoke — the audit's verbatim phrasing is gone)

  $ bash scripts/ci-guards/G-3-env-docs-drift.sh
    G-3 env-docs-drift: clean.

  $ bash scripts/ci-guards/complete-path-config-coverage.sh
    OK — every CERTCTL_* env var (197) has at least one non-config-
    package consumer.

Selector contract verified — config.Validate() rejects any value
other than ""/memory/postgres at startup with a clear error message.

Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a
different axis; ARCH-M1 closure is complete with this commit
modulo the Sprint 13.7 audit-HTML flip + zero-floor pin.

Closes: ARCH-M1 substantive remediation. The cross-replica rate-
limit-cap-enforcement gap that the audit recommended deferring to
v3 is closed; operators with server.replicas > 1 flip
CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement
across the cluster (proved by the multi-replica integration test now
gating CI).
2026-05-14 11:52:13 +00:00

14 KiB
Raw Blame History

Observability — what certctl emits, what it doesn't, and what survives a restart

Last reviewed: 2026-05-13

Use this when:

  • You're sizing certctl's observability surface against your existing metrics + tracing + logging stack and want to know exactly what drops in cleanly and what gaps you'll need to bridge.
  • You're investigating a "weird metric" or planning a Grafana dashboard and need the canonical list of what's exposed.
  • You're running multi-replica or restarting frequently and need to understand which counters reset.

certctl's observability posture is deliberately minimal-but-honest: ship the surfaces an operator actually needs to wire into a Prometheus

  • Grafana + Loki stack, and don't make claims the implementation can't back. This document is the canonical statement of what's emitted, what's deferred, and why.

Metrics — what's emitted

certctl exposes metrics through two endpoints on the control plane:

Endpoint Content-Type Audience
GET /api/v1/metrics application/json Dashboards that prefer JSON, ad-hoc curl
GET /api/v1/metrics/prometheus text/plain; version=0.0.4; charset=utf-8 (Prometheus exposition) Prometheus, Grafana Agent, Datadog Agent, Victoria Metrics, any OpenMetrics-compatible scraper

The Prometheus endpoint emits standard # HELP / # TYPE / metric lines following the conventions at prometheus.io/docs/instrumenting/exposition_formats. Metric names are lowercase, snake_case, and prefixed with certctl_.

The implementation is at internal/api/handler/metrics.go.

What's covered

Run the endpoint against a live deployment for the authoritative list (it expands as the service ships more metrics). At time of writing the exposition includes:

  • Certificate-inventory gauges: certctl_certificate_total, certctl_certificate_active, certctl_certificate_expiring_soon, certctl_certificate_expired, certctl_certificate_revoked.
  • Per-issuer-type issuance histograms: certctl_issuance_duration_seconds{issuer_type=…} (the 2026-05-01 issuer-coverage audit closure #4 — this is the load-bearing metric for per-issuer SLOs).
  • Server uptime: certctl_uptime_seconds.

Prometheus library vs hand-rolled exposition (acquisition diligence)

certctl writes Prometheus exposition format with fmt.Fprintf from the metrics handler, not via the github.com/prometheus/client_golang library. This is intentional for v2.x:

  • The metric surface is shallow (gauges + a handful of histograms with static labels). The client library's value is on the registration + thread-safe accumulation side, neither of which is load-bearing for the current surface.
  • The exposition output is pinned to the spec version explicitly (version=0.0.4) and is unit-tested against expected output at internal/api/handler/stats_handler_test.go.
  • Swapping in client_golang is a mechanical migration when the metric surface grows (per-connector counters + RED-method histograms on every handler are the natural next surface), but it has no operator-visible behavior change today.

The migration is on the WORKSPACE-ROADMAP.md as a v3 item. If you're an acquirer reading this: the question to ask is "does the metric surface meet our SLO needs today" — not "is the right library under the hood." If the answer to the first question is yes, the second is a refactor, not a feature gap.

Tracing — explicitly not yet shipped

certctl does not ship distributed tracing instrumentation today:

  • No OpenTelemetry SDK setup in cmd/server/main.go.
  • No OTLP exporter wired into outbound calls (issuer connectors, agent enrollment, etc.).
  • The go.opentelemetry.io/otel packages that appear in go.mod are indirect-only — they're transitive dependencies of coreos/go-oidc and similar.

This is honest: there is no in-process tracing surface to monitor, correlate, or sample. If your environment requires end-to-end traces across the certctl control plane + agents + issuer backends, this is a gap you would close on the certctl side as part of a v3 work item. Until then:

  • Structured logs include a request_id you can correlate across the server log stream. See internal/api/middleware/request_id.go.
  • The Prometheus histogram certctl_issuance_duration_seconds{issuer_type=…} carries the same per-issuer latency signal a trace span would, just without the per-request fan-out.

OpenTelemetry instrumentation is tracked in WORKSPACE-ROADMAP.md as a v3 item.

Logging

certctl emits structured JSON logs to stdout via the stdlib log/slog package. Every line carries time, level, msg, and — where relevant — request_id, actor_id, and a contextual subject (certificate_id, issuer_id, agent_id, etc.).

Log level is controlled by CERTCTL_LOG_LEVEL (debug / info / warn / error); defaults to info. There is no in-process log ingest — operators are expected to collect from container stdout into their existing log pipeline (Loki, CloudWatch Logs, Datadog, ELK, Splunk, etc.).

No log line contains private-key material, bearer tokens, OIDC client secrets, or session cookies. The break-glass login path explicitly scrubs the password before it reaches the audit subsystem (see docs/operator/auth-threat-model.md § "Break-glass token leak").

Rate-limit behavior — configurable backend (memory or postgres)

The sliding-window-log rate limiters used across certctl's authenticated-but-shared-credential code paths (break-glass login, OCSP per-IP, cert-export per-actor, EST per-principal, EST failed-basic source-IP) carry a configurable backend. The operator picks between two implementations via CERTCTL_RATE_LIMIT_BACKEND:

Value When to use
memory Default. Single-replica deploys; sketchpad / dev.
postgres HA deploys (server.replicas > 1). Cross-replica-consistent.

Phase 13 Sprint 13.2/13.3 (architecture diligence audit ARCH-M1 closure) replaced the prior single-process limitation with a substantive close: when the operator opts into postgres, all replicas share the same rate_limit_buckets table (migration 000046) and per-key access is arbitrated via SELECT FOR UPDATE row locks. A 3-replica cluster hitting one rate-limited endpoint concurrently sees exactly the configured cap succeed across the cluster — not 3× the cap as the old per-process backend would have allowed.

Operator decision tree

Single replica (server.replicas = 1, the helm chart default)?
  └─ Use CERTCTL_RATE_LIMIT_BACKEND=memory (the default; no action
     required). Bucket lookups stay in-process; zero DB round-trips
     on the hot path.

Two or more replicas?
  └─ Use CERTCTL_RATE_LIMIT_BACKEND=postgres. Two extra DB round-trips
     per Allow call (BEGIN ... SELECT FOR UPDATE ... UPDATE ... COMMIT);
     acceptable on the gated hot path. The Sprint 13.2 multi-replica
     integration test pins exactly-cap enforcement across N replicas
     as the closure proof.

Inventory

Limiter Scope Window Cap
Break-glass login (per source-IP) internal/api/handler/auth_breakglass.go 60s 5 attempts
OCSP query (per source-IP) internal/api/handler/certificates.go 60s configurable (CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN)
Cert export (per actor) internal/api/handler/export.go 1h configurable (CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR)
EST per-principal CSR enrollment internal/api/handler/est.go 24h configurable (per-profile RateLimitPerPrincipal24h)
EST HTTP-Basic source-IP failed-auth internal/api/handler/est.go 60m 10 attempts
SCEP/Intune per-device challenge internal/scep/intune/ 60s configurable (*_PER_MINUTE)
ACME per-account orders / key-change / challenge-respond internal/service/acme.go 1h configurable

The CERTCTL_RATE_LIMIT_BACKEND selector applies to the first five (the cmd/server-wired limiters). The SCEP/Intune wrapper + the ACME per-account limiter ride their own internal accounting today; both are tracked as follow-ups in WORKSPACE-ROADMAP.md.

Backend internals

Both backends share the algorithm: sliding-window log + per-key bucket + prune-on-Allow.

Memory backend (memory) — per-process map keyed by bucket key; mutex-guarded; package-level LRU cap prevents unbounded growth under adversarial key cardinality (default 100,000 keys per limiter instance; oldest-by-newest-timestamp evicted under pressure). Implemented at internal/ratelimit/sliding_window.go.

Postgres backend (postgres) — same algorithm against the rate_limit_buckets table:

CREATE TABLE rate_limit_buckets (
    bucket_key TEXT          PRIMARY KEY,
    timestamps TIMESTAMPTZ[] NOT NULL DEFAULT '{}',
    updated_at TIMESTAMPTZ   NOT NULL DEFAULT NOW()
);

Allow(key, now) opens a transaction, ensures the row exists (INSERT ... ON CONFLICT DO NOTHING), acquires the row lock (SELECT ... FOR UPDATE), prunes timestamps older than now-window, compares the post-prune count against maxN, conditionally appends now, persists, and commits. The row lock is what arbitrates across replicas: replicas A and B firing simultaneous Allow("k") never race because Postgres serializes the per-key row update across the cluster. Implemented at internal/ratelimit/postgres_sliding_window.go.

Janitor sweep (postgres backend only)

The scheduler runs a rate_limit_buckets janitor every CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m, minimum 1m). The sweep deletes rows whose updated_at is older than the longest configured window any limiter uses (24h today, matching the EST per-principal limiter). Idempotent; repeated sweeps find zero rows. The memory backend's prune-on-Allow path keeps buckets short-lived without a separate sweep, so the loop is a no-op when backend=memory.

Falsifiable closure proof

The Phase 13 Sprint 13.2 integration test internal/integration/ratelimit_multi_replica_test.go (//go:build integration) fires 100 concurrent Allow("test-key") calls round-robined across 3 independent PostgresSlidingWindowLimiter instances sharing one Postgres database (cap=10, window=1m) and asserts exactly 10 succeed + 90 return ErrRateLimited. If the cross-replica row lock weren't arbitrating, each replica would independently let through ~3-4 requests, giving 12-15 successes total. Re-run:

go test -tags=integration -count=1 -run TestRateLimit_MultiReplica \
    ./internal/integration/...

Helm chart wiring

The helm chart at deploy/helm/certctl/ exposes the backend via server.rateLimiting.backend (default memory). To opt into the postgres backend for an HA deploy:

helm upgrade --install certctl deploy/helm/certctl \
    --set server.replicas=3 \
    --set server.rateLimiting.backend=postgres \
    --set server.rateLimiting.janitorInterval=5m

server.replicas > 1 without flipping backend to postgres works fine — the limits stay per-process — but the operator gets a 2× / 3× / Nx effective cap depending on replica count. The chart does NOT auto-flip on replicas > 1 because some HA deploys deliberately want per-process limits (sticky-session ingress + tight per-replica caps to detect bot traffic at the edge before it hits the application).

Where these numbers live

The configurable caps are exposed as CERTCTL_*_PER_MINUTE / CERTCTL_ACME_*_PER_HOUR env vars — see the security posture doc for the operator-facing configuration surface. The hard-coded ones (break-glass 5/min) are intentionally non-configurable as a defense-in-depth measure; the auth subsystem owns that policy decision.

Performance harness scope

The load-test harness at deploy/test/loadtest/ covers the API-tier hot paths (issuance acceptance + cert list). It does NOT load-test issuer-connector round-trips (you'd be load- testing someone else's API), full multi-RTT ACME enrollment flows, bulk-revoke / bulk-renew admin paths, or scheduler concurrency under bulk renewal. Each exclusion is justified in deploy/test/loadtest/README.md under "What it explicitly does NOT measure." If your evaluation requires a benchmark on one of those exclusions, the right next step is a follow-up scenario in that directory.

The per-component benchmarks ship in-tree as Go Benchmark* functions:

  • internal/auth/session/bench_test.go — session signing + validation steady state and cold-process timing.
  • internal/auth/oidc/bench_test.go — OIDC verify steady state.
  • internal/auth/oidc/bench_keycloak_test.go — OIDC cold-cache timing (gated //go:build integration).

Authoritative benchmark numbers + threshold contracts: docs/operator/auth-benchmarks.md (auth subsystem) and docs/operator/performance-baselines.md (general API tier).