Closes acquisition-diligence Bundle 12 — Observability, DR,
Operations Receipts, And Performance Proof. Source IDs: D5, D6, D8,
T9, finding 7, OPS-H1, OPS-M1, OPS-M2, LOW-7.
Two new operator-facing references; both non-audit-framed per the
Bundle 5 doc-placement policy.
docs/operator/observability.md — single canonical statement of what
certctl emits, what it doesn't, and what survives a restart:
- Metrics surface: both /api/v1/metrics (JSON) and
/api/v1/metrics/prometheus (text exposition v0.0.4); inventory of
certctl_certificate_* gauges + certctl_issuance_duration_seconds
per-issuer-type histogram + certctl_uptime_seconds.
- Prometheus library vs hand-rolled exposition: explicit scope
statement — hand-rolled fmt.Fprintf is intentional for v2.x given
the shallow metric surface; client_golang migration tracked as
v3 item (closes OPS-M1).
- Tracing: explicit deferral — no OTel SDK setup, OTel packages
are indirect-only in go.mod, no spans, no OTLP exporter; tracked
as v3 item; in the meantime structured logs carry request_id and
certctl_issuance_duration_seconds carries the per-issuer latency
signal (closes OPS-M2).
- Logging: structured JSON via log/slog; CERTCTL_LOG_LEVEL control;
no key material / bearer tokens / session cookies in log lines.
- Rate-limit semantics under restarts + replicas: per-process,
in-memory, reset-on-restart, NOT shared across replicas; full
inventory of the 5 limiter call sites (break-glass login,
SCEP/Intune per-device, EST per-principal CSR, EST HTTP-Basic
source-IP, ACME per-account); multi-replica + sticky-session
implications; database-backed sliding window deferred to v3
(closes D8).
- Performance harness scope: cross-references the explicit
'What it explicitly does NOT measure' list in
deploy/test/loadtest/README.md (closes LOW-7 + finding 7).
docs/operator/runbooks/postgres-backup.md — operator-runnable
backup procedure:
- Inventory of what to back up (DB + operator-managed file
material that lives outside the DB: CA keys, RA keys, OCSP
responder keys, trust bundles).
- Logical backup recipe with docker-compose + Kubernetes variants,
integrity verification step, off-host storage step.
- Physical / PITR recipe pointing at pgbackrest / wal-g
(certctl ships nothing here — standard PostgreSQL DBA work).
- Three sample automation paths (in-cluster Postgres → S3 CronJob,
managed Postgres PITR, self-hosted VM systemd timer + restic).
- Quarterly restore-dry-run procedure.
- Helm CronJob template deliberately not shipped — three
documented reasons (deployment topology / secret-management
integration / off-host storage all vary by operator) plus
roadmap entry for shipping a starter template when a real
operator asks for one (closes D6 + OPS-H1).
Both new docs wired into docs/README.md Operator + Runbooks tables.
D5 (ServiceMonitor) and T9 (canonical k6 load-test) were already
shipped in Bundle 3 (deploy/helm/certctl/templates/servicemonitor.yaml)
and in deploy/test/loadtest/ + .github/workflows/loadtest.yml
respectively; this bundle doesn't touch them — it just records the
closure in the audit HTML.
Verified:
bash scripts/ci-guards/G-3-env-docs-drift.sh # PASS
bash scripts/ci-guards/doc-rot-detector.sh # PASS
All 35 scripts/ci-guards/*.sh green.
11 KiB
Observability — what certctl emits, what it doesn't, and what survives a restart
Last reviewed: 2026-05-13
Use this when:
- You're sizing certctl's observability surface against your existing metrics + tracing + logging stack and want to know exactly what drops in cleanly and what gaps you'll need to bridge.
- You're investigating a "weird metric" or planning a Grafana dashboard and need the canonical list of what's exposed.
- You're running multi-replica or restarting frequently and need to understand which counters reset.
certctl's observability posture is deliberately minimal-but-honest: ship the surfaces an operator actually needs to wire into a Prometheus
- Grafana + Loki stack, and don't make claims the implementation can't back. This document is the canonical statement of what's emitted, what's deferred, and why.
Metrics — what's emitted
certctl exposes metrics through two endpoints on the control plane:
| Endpoint | Content-Type | Audience |
|---|---|---|
GET /api/v1/metrics |
application/json |
Dashboards that prefer JSON, ad-hoc curl |
GET /api/v1/metrics/prometheus |
text/plain; version=0.0.4; charset=utf-8 (Prometheus exposition) |
Prometheus, Grafana Agent, Datadog Agent, Victoria Metrics, any OpenMetrics-compatible scraper |
The Prometheus endpoint emits standard # HELP / # TYPE / metric
lines following the conventions at
prometheus.io/docs/instrumenting/exposition_formats.
Metric names are lowercase, snake_case, and prefixed with certctl_.
The implementation is at
internal/api/handler/metrics.go.
What's covered
Run the endpoint against a live deployment for the authoritative list (it expands as the service ships more metrics). At time of writing the exposition includes:
- Certificate-inventory gauges:
certctl_certificate_total,certctl_certificate_active,certctl_certificate_expiring_soon,certctl_certificate_expired,certctl_certificate_revoked. - Per-issuer-type issuance histograms:
certctl_issuance_duration_seconds{issuer_type=…}(the 2026-05-01 issuer-coverage audit closure #4 — this is the load-bearing metric for per-issuer SLOs). - Server uptime:
certctl_uptime_seconds.
Prometheus library vs hand-rolled exposition (acquisition diligence)
certctl writes Prometheus exposition format with fmt.Fprintf from
the metrics handler, not via the github.com/prometheus/client_golang
library. This is intentional for v2.x:
- The metric surface is shallow (gauges + a handful of histograms with static labels). The client library's value is on the registration + thread-safe accumulation side, neither of which is load-bearing for the current surface.
- The exposition output is pinned to the spec version explicitly
(
version=0.0.4) and is unit-tested against expected output atinternal/api/handler/stats_handler_test.go. - Swapping in
client_golangis a mechanical migration when the metric surface grows (per-connector counters + RED-method histograms on every handler are the natural next surface), but it has no operator-visible behavior change today.
The migration is on the WORKSPACE-ROADMAP.md as a v3 item. If you're an acquirer reading this: the question to ask is "does the metric surface meet our SLO needs today" — not "is the right library under the hood." If the answer to the first question is yes, the second is a refactor, not a feature gap.
Tracing — explicitly not yet shipped
certctl does not ship distributed tracing instrumentation today:
- No OpenTelemetry SDK setup in
cmd/server/main.go. - No OTLP exporter wired into outbound calls (issuer connectors, agent enrollment, etc.).
- The
go.opentelemetry.io/otelpackages that appear ingo.modare indirect-only — they're transitive dependencies ofcoreos/go-oidcand similar.
This is honest: there is no in-process tracing surface to monitor, correlate, or sample. If your environment requires end-to-end traces across the certctl control plane + agents + issuer backends, this is a gap you would close on the certctl side as part of a v3 work item. Until then:
- Structured logs include a
request_idyou can correlate across the server log stream. Seeinternal/api/middleware/request_id.go. - The Prometheus histogram
certctl_issuance_duration_seconds{issuer_type=…}carries the same per-issuer latency signal a trace span would, just without the per-request fan-out.
OpenTelemetry instrumentation is tracked in WORKSPACE-ROADMAP.md as a v3 item.
Logging
certctl emits structured JSON logs to stdout via the stdlib
log/slog package. Every line carries time, level, msg, and —
where relevant — request_id, actor_id, and a contextual subject
(certificate_id, issuer_id, agent_id, etc.).
Log level is controlled by CERTCTL_LOG_LEVEL (debug / info /
warn / error); defaults to info. There is no in-process log
ingest — operators are expected to collect from container stdout
into their existing log pipeline (Loki, CloudWatch Logs, Datadog,
ELK, Splunk, etc.).
No log line contains private-key material, bearer tokens, OIDC
client secrets, or session cookies. The break-glass login path
explicitly scrubs the password before it reaches the audit subsystem
(see docs/operator/auth-threat-model.md §
"Break-glass token leak").
Rate-limit behavior under restarts and replicas
Where rate limits exist, they are per-process, in-memory, reset-on-restart, and not shared across replicas. This matters for multi-replica deployments and for any compliance posture that asks "what limits apply globally vs per-pod."
Inventory
| Limiter | Scope | Window | Cap | Survives restart? | Shared across replicas? |
|---|---|---|---|---|---|
| Break-glass login (per source-IP) | internal/api/handler/auth_breakglass.go |
60s | 5 attempts | No | No |
| SCEP/Intune per-device challenge | internal/scep/intune/ |
60s | configurable (*_PER_MINUTE) |
No | No |
| EST per-principal CSR enrollment | internal/est/ |
60s | configurable | No | No |
| EST HTTP-Basic source-IP failed-auth | internal/est/ |
60s | configurable | No | No |
| ACME per-account orders / key-change / challenge-respond | internal/service/acme.go |
1h | configurable | No | No |
All five use the shared internal/ratelimit/sliding_window.go
primitive. Buckets live in a single per-process map guarded by a
mutex; the package-level cap prevents unbounded growth under
adversarial key cardinality (default 100,000 keys; oldest-by-newest-
timestamp evicted under pressure).
Implications for multi-replica deployments
- Effective per-replica cap is the documented cap. A 2-replica deployment lets through up to 2× the per-key window cap before either replica rejects.
- Restart resets the bucket. A
kubectl rollout restartempties the in-memory windows on every replica. An attacker who notices this could in principle re-issue burst attempts after every roll; the threat model accepts this because rollouts are operator-driven and the relevant endpoints already require credentials. - No cross-replica fan-out. Rate-limit decisions on replica A
are not visible to replica B. Sticky-session ingress routing (with
service.spec.sessionAffinity: ClientIPon Kubernetes or the equivalent on your load balancer) tightens the effective cap to per-replica + per-source-IP rather than per-replica + per-source-IP for whichever pod the request happened to land on.
If your threat model requires globally-enforced rate limits across replicas, the implementation surface is roughly: swap the per-process map for a database-backed sliding window (or a Redis-backed equivalent if you already run Redis). This is on the WORKSPACE-ROADMAP.md as a v3 item; nothing in the certctl threat model today requires it.
Where these numbers live
The configurable caps are exposed as CERTCTL_*_PER_MINUTE /
CERTCTL_ACME_*_PER_HOUR env vars — see the
security posture doc for the operator-facing
configuration surface. The hard-coded ones (break-glass 5/min) are
intentionally non-configurable as a defense-in-depth measure; the
auth subsystem owns that policy decision.
Performance harness scope
The load-test harness at deploy/test/loadtest/
covers the API-tier hot paths (issuance acceptance + cert list). It
does NOT load-test issuer-connector round-trips (you'd be load-
testing someone else's API), full multi-RTT ACME enrollment flows,
bulk-revoke / bulk-renew admin paths, or scheduler concurrency under
bulk renewal. Each exclusion is justified in
deploy/test/loadtest/README.md
under "What it explicitly does NOT measure." If your evaluation
requires a benchmark on one of those exclusions, the right next step
is a follow-up scenario in that directory.
The per-component benchmarks ship in-tree as Go Benchmark*
functions:
internal/auth/session/bench_test.go— session signing + validation steady state and cold-process timing.internal/auth/oidc/bench_test.go— OIDC verify steady state.internal/auth/oidc/bench_keycloak_test.go— OIDC cold-cache timing (gated//go:build integration).
Authoritative benchmark numbers + threshold contracts:
docs/operator/auth-benchmarks.md (auth
subsystem) and docs/operator/performance-baselines.md
(general API tier).
Related reading
docs/operator/security.md— the broader hardening posture; this document is its observability subset.docs/operator/performance-baselines.md— operator-runnable benchmarks against the API tierdocs/operator/auth-benchmarks.md— session- OIDC validation timings + threshold contracts
deploy/test/loadtest/README.md— k6 load-test harness scope + threshold contractdocs/operator/runbooks/postgres-backup.md— operator-run backup recipe (separate file because it's a procedural runbook, not an observability claim)