Sprint 6 ACQ DEPL-006 closure follow-up. The G-3-env-docs-drift
ci-guard scans `internal/` + `cmd/` for every CERTCTL_*
env-var reference and cross-checks against README + docs/ +
deploy/helm/ + deploy/ENVIRONMENTS.md. The OTel-seed commit
(35277c0) introduced `CERTCTL_OTEL_ENABLED` in
`internal/config/config.go` + `cmd/server/main.go` but didn't
add the matching doc entry, so the guard caught the drift on
the next CI run with:
G-3 regression: env var(s) defined in Go source but never documented:
CERTCTL_OTEL_ENABLED
Replaces the existing "Tracing — explicitly not yet shipped"
subsection in docs/operator/observability.md with an honest
"Tracing — OTLP surface available, instrumentation pending"
section that:
- Documents the env var + the standard OTEL_* env vars the SDK
honors (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, etc.).
- Explains the OTLP/HTTP transport choice (vs gRPC) per the
rationale in internal/observability/otel.go's header.
- Pins what the current release DOES (surface + lazy connect +
graceful shutdown) vs DOES NOT (per-handler / per-DB /
per-connector spans).
- Notes the no-op-shutdown contract so operators can defer
unconditionally.
- Cross-references the existing request_id correlation + per-
issuer Prometheus histogram as the interim correlation surface.
- Repoints the "future work" tracker from the old "v3 item"
framing to WORKSPACE-ROADMAP.md §2 (Phase 4 in the path-b
build plan).
Verified locally: `bash scripts/ci-guards/G-3-env-docs-drift.sh`
exits 0 ("G-3 env-docs-drift: clean").
16 KiB
Observability — what certctl emits, what it doesn't, and what survives a restart
Last reviewed: 2026-05-13
Use this when:
- You're sizing certctl's observability surface against your existing metrics + tracing + logging stack and want to know exactly what drops in cleanly and what gaps you'll need to bridge.
- You're investigating a "weird metric" or planning a Grafana dashboard and need the canonical list of what's exposed.
- You're running multi-replica or restarting frequently and need to understand which counters reset.
certctl's observability posture is deliberately minimal-but-honest: ship the surfaces an operator actually needs to wire into a Prometheus
- Grafana + Loki stack, and don't make claims the implementation can't back. This document is the canonical statement of what's emitted, what's deferred, and why.
Metrics — what's emitted
certctl exposes metrics through two endpoints on the control plane:
| Endpoint | Content-Type | Audience |
|---|---|---|
GET /api/v1/metrics |
application/json |
Dashboards that prefer JSON, ad-hoc curl |
GET /api/v1/metrics/prometheus |
text/plain; version=0.0.4; charset=utf-8 (Prometheus exposition) |
Prometheus, Grafana Agent, Datadog Agent, Victoria Metrics, any OpenMetrics-compatible scraper |
The Prometheus endpoint emits standard # HELP / # TYPE / metric
lines following the conventions at
prometheus.io/docs/instrumenting/exposition_formats.
Metric names are lowercase, snake_case, and prefixed with certctl_.
The implementation is at
internal/api/handler/metrics.go.
What's covered
Run the endpoint against a live deployment for the authoritative list (it expands as the service ships more metrics). At time of writing the exposition includes:
- Certificate-inventory gauges:
certctl_certificate_total,certctl_certificate_active,certctl_certificate_expiring_soon,certctl_certificate_expired,certctl_certificate_revoked. - Per-issuer-type issuance histograms:
certctl_issuance_duration_seconds{issuer_type=…}(the 2026-05-01 issuer-coverage audit closure #4 — this is the load-bearing metric for per-issuer SLOs). - Server uptime:
certctl_uptime_seconds.
Prometheus library vs hand-rolled exposition (acquisition diligence)
certctl writes Prometheus exposition format with fmt.Fprintf from
the metrics handler, not via the github.com/prometheus/client_golang
library. This is intentional for v2.x:
- The metric surface is shallow (gauges + a handful of histograms with static labels). The client library's value is on the registration + thread-safe accumulation side, neither of which is load-bearing for the current surface.
- The exposition output is pinned to the spec version explicitly
(
version=0.0.4) and is unit-tested against expected output atinternal/api/handler/stats_handler_test.go. - Swapping in
client_golangis a mechanical migration when the metric surface grows (per-connector counters + RED-method histograms on every handler are the natural next surface), but it has no operator-visible behavior change today.
The migration is on the WORKSPACE-ROADMAP.md as a v3 item. If you're an acquirer reading this: the question to ask is "does the metric surface meet our SLO needs today" — not "is the right library under the hood." If the answer to the first question is yes, the second is a refactor, not a feature gap.
Tracing — OTLP surface available, instrumentation pending
Sprint 6 ACQ DEPL-006 closure (2026-05-16) stood up the OTel tracer- provider surface. Operators with an OTel collector can opt in via:
CERTCTL_OTEL_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.example.com:4318
When CERTCTL_OTEL_ENABLED is true, cmd/server/main.go calls
internal/observability.Init which:
- Constructs an OTLP/HTTP exporter (chosen over OTLP/gRPC to keep
the dependency surface narrow — see
internal/observability/otel.goheader for the transport-choice rationale). - Registers a real
sdktrace.TracerProvideras the otel global. - Honors the standard OTel env vars (
OTEL_EXPORTER_OTLP_ENDPOINT,OTEL_EXPORTER_OTLP_HEADERS,OTEL_EXPORTER_OTLP_INSECURE,OTEL_SERVICE_NAMEoverrides the defaultcertctl-server, etc.). - Defers a graceful shutdown that flushes the in-flight batcher.
What this does not ship yet:
- No per-handler / per-DB / per-connector span instrumentation in the certctl code base. The OTel SDK emits the spans it generates internally (process resource attributes, eventual stdlib HTTP spans), but certctl-domain spans (issuance, renewal, deployment, agent enrollment) are a v2.3 roadmap follow-up.
- No tracing-correlated metric exemplars in the Prometheus histograms above. Those still ship the per-issuer latency signal without per-request fan-out.
- No backwards-compat shim — operators who never set
CERTCTL_OTEL_ENABLED(the default) see zero behavior change. The init returns a no-op shutdown so the deferred call is safe to invoke unconditionally.
When this matters today:
- Operators wiring up a v3 instrumentation effort have the OTel
surface in place; they only need to add
tracer.Start(ctx, "…")call sites in the handler/service code. - Operators evaluating certctl for acquisition / due-diligence see
an opt-in OTel surface in the current release rather than a "v3
roadmap item" — a useful signal for buyer credibility per the
acquisition-thesis framing in
WORKSPACE-ROADMAP.md§3.
Existing correlation surfaces stay in place until span coverage ships:
- Structured logs include a
request_idyou can correlate across the server log stream. Seeinternal/api/middleware/request_id.go. - The Prometheus histogram
certctl_issuance_duration_seconds{issuer_type=…}carries the same per-issuer latency signal a trace span would, just without the per-request fan-out.
Per-handler / per-query / per-connector span instrumentation is tracked in WORKSPACE-ROADMAP.md under §2 (NHI / Agent Identity, Phase 4 in the path-b build plan).
Logging
certctl emits structured JSON logs to stdout via the stdlib
log/slog package. Every line carries time, level, msg, and —
where relevant — request_id, actor_id, and a contextual subject
(certificate_id, issuer_id, agent_id, etc.).
Log level is controlled by CERTCTL_LOG_LEVEL (debug / info /
warn / error); defaults to info. There is no in-process log
ingest — operators are expected to collect from container stdout
into their existing log pipeline (Loki, CloudWatch Logs, Datadog,
ELK, Splunk, etc.).
No log line contains private-key material, bearer tokens, OIDC
client secrets, or session cookies. The break-glass login path
explicitly scrubs the password before it reaches the audit subsystem
(see docs/operator/auth-threat-model.md §
"Break-glass token leak").
Rate-limit behavior — configurable backend (memory or postgres)
The sliding-window-log rate limiters used across certctl's
authenticated-but-shared-credential code paths (break-glass login,
OCSP per-IP, cert-export per-actor, EST per-principal, EST
failed-basic source-IP) carry a configurable backend. The
operator picks between two implementations via
CERTCTL_RATE_LIMIT_BACKEND:
| Value | When to use |
|---|---|
memory |
Default. Single-replica deploys; sketchpad / dev. |
postgres |
HA deploys (server.replicas > 1). Cross-replica-consistent. |
Phase 13 Sprint 13.2/13.3 (architecture diligence audit ARCH-M1
closure) replaced the prior single-process limitation with a
substantive close: when the operator opts into postgres, all
replicas share the same
rate_limit_buckets table (migration 000046) and per-key access is
arbitrated via SELECT FOR UPDATE row locks. A 3-replica cluster
hitting one rate-limited endpoint concurrently sees exactly the
configured cap succeed across the cluster — not 3× the cap as the
old per-process backend would have allowed.
Operator decision tree
Single replica (server.replicas = 1, the helm chart default)?
└─ Use CERTCTL_RATE_LIMIT_BACKEND=memory (the default; no action
required). Bucket lookups stay in-process; zero DB round-trips
on the hot path.
Two or more replicas?
└─ Use CERTCTL_RATE_LIMIT_BACKEND=postgres. Two extra DB round-trips
per Allow call (BEGIN ... SELECT FOR UPDATE ... UPDATE ... COMMIT);
acceptable on the gated hot path. The Sprint 13.2 multi-replica
integration test pins exactly-cap enforcement across N replicas
as the closure proof.
Inventory
| Limiter | Scope | Window | Cap |
|---|---|---|---|
| Break-glass login (per source-IP) | internal/api/handler/auth_breakglass.go |
60s | 5 attempts |
| OCSP query (per source-IP) | internal/api/handler/certificates.go |
60s | configurable (CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN) |
| Cert export (per actor) | internal/api/handler/export.go |
1h | configurable (CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR) |
| EST per-principal CSR enrollment | internal/api/handler/est.go |
24h | configurable (per-profile RateLimitPerPrincipal24h) |
| EST HTTP-Basic source-IP failed-auth | internal/api/handler/est.go |
60m | 10 attempts |
| SCEP/Intune per-device challenge | internal/scep/intune/ |
60s | configurable (*_PER_MINUTE) |
| ACME per-account orders / key-change / challenge-respond | internal/service/acme.go |
1h | configurable |
The CERTCTL_RATE_LIMIT_BACKEND selector applies to the first five
(the cmd/server-wired limiters). The SCEP/Intune wrapper + the ACME
per-account limiter ride their own internal accounting today; both
are tracked as follow-ups in WORKSPACE-ROADMAP.md.
Backend internals
Both backends share the algorithm: sliding-window log + per-key bucket + prune-on-Allow.
Memory backend (memory) — per-process map keyed by bucket key;
mutex-guarded; package-level LRU cap prevents unbounded growth under
adversarial key cardinality (default 100,000 keys per limiter
instance; oldest-by-newest-timestamp evicted under pressure).
Implemented at internal/ratelimit/sliding_window.go.
Postgres backend (postgres) — same algorithm against the
rate_limit_buckets table:
CREATE TABLE rate_limit_buckets (
bucket_key TEXT PRIMARY KEY,
timestamps TIMESTAMPTZ[] NOT NULL DEFAULT '{}',
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Allow(key, now) opens a transaction, ensures the row exists
(INSERT ... ON CONFLICT DO NOTHING), acquires the row lock
(SELECT ... FOR UPDATE), prunes timestamps older than now-window,
compares the post-prune count against maxN, conditionally appends
now, persists, and commits. The row lock is what arbitrates across
replicas: replicas A and B firing simultaneous Allow("k") never
race because Postgres serializes the per-key row update across the
cluster. Implemented at
internal/ratelimit/postgres_sliding_window.go.
Janitor sweep (postgres backend only)
The scheduler runs a rate_limit_buckets janitor every
CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m, minimum 1m). The
sweep deletes rows whose updated_at is older than the longest
configured window any limiter uses (24h today, matching the EST
per-principal limiter). Idempotent; repeated sweeps find zero rows.
The memory backend's prune-on-Allow path keeps buckets short-lived
without a separate sweep, so the loop is a no-op when
backend=memory.
Falsifiable closure proof
The Phase 13 Sprint 13.2 integration test
internal/integration/ratelimit_multi_replica_test.go
(//go:build integration) fires 100 concurrent Allow("test-key")
calls round-robined across 3 independent PostgresSlidingWindowLimiter
instances sharing one Postgres database (cap=10, window=1m) and
asserts exactly 10 succeed + 90 return ErrRateLimited. If the
cross-replica row lock weren't arbitrating, each replica would
independently let through ~3-4 requests, giving 12-15 successes
total. Re-run:
go test -tags=integration -count=1 -run TestRateLimit_MultiReplica \
./internal/integration/...
Helm chart wiring
The helm chart at deploy/helm/certctl/ exposes the backend via
server.rateLimiting.backend (default memory). To opt into the
postgres backend for an HA deploy:
helm upgrade --install certctl deploy/helm/certctl \
--set server.replicas=3 \
--set server.rateLimiting.backend=postgres \
--set server.rateLimiting.janitorInterval=5m
server.replicas > 1 without flipping backend to postgres works
fine — the limits stay per-process — but the operator gets a 2× /
3× / Nx effective cap depending on replica count. The chart does NOT
auto-flip on replicas > 1 because some HA deploys deliberately want
per-process limits (sticky-session ingress + tight per-replica caps
to detect bot traffic at the edge before it hits the application).
Where these numbers live
The configurable caps are exposed as CERTCTL_*_PER_MINUTE /
CERTCTL_ACME_*_PER_HOUR env vars — see the
security posture doc for the operator-facing
configuration surface. The hard-coded ones (break-glass 5/min) are
intentionally non-configurable as a defense-in-depth measure; the
auth subsystem owns that policy decision.
Performance harness scope
The load-test harness at deploy/test/loadtest/
covers the API-tier hot paths (issuance acceptance + cert list). It
does NOT load-test issuer-connector round-trips (you'd be load-
testing someone else's API), full multi-RTT ACME enrollment flows,
bulk-revoke / bulk-renew admin paths, or scheduler concurrency under
bulk renewal. Each exclusion is justified in
deploy/test/loadtest/README.md
under "What it explicitly does NOT measure." If your evaluation
requires a benchmark on one of those exclusions, the right next step
is a follow-up scenario in that directory.
The per-component benchmarks ship in-tree as Go Benchmark*
functions:
internal/auth/session/bench_test.go— session signing + validation steady state and cold-process timing.internal/auth/oidc/bench_test.go— OIDC verify steady state.internal/auth/oidc/bench_keycloak_test.go— OIDC cold-cache timing (gated//go:build integration).
Authoritative benchmark numbers + threshold contracts:
docs/operator/auth-benchmarks.md (auth
subsystem) and docs/operator/performance-baselines.md
(general API tier).
Related reading
docs/operator/security.md— the broader hardening posture; this document is its observability subset.docs/operator/performance-baselines.md— operator-runnable benchmarks against the API tierdocs/operator/auth-benchmarks.md— session- OIDC validation timings + threshold contracts
deploy/test/loadtest/README.md— k6 load-test harness scope + threshold contractdocs/operator/runbooks/postgres-backup.md— operator-run backup recipe (separate file because it's a procedural runbook, not an observability claim)