mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 15:32:02 +00:00
57b539c378
Closes acquisition-diligence Bundle 12 — Observability, DR,
Operations Receipts, And Performance Proof. Source IDs: D5, D6, D8,
T9, finding 7, OPS-H1, OPS-M1, OPS-M2, LOW-7.
Two new operator-facing references; both non-audit-framed per the
Bundle 5 doc-placement policy.
docs/operator/observability.md — single canonical statement of what
certctl emits, what it doesn't, and what survives a restart:
- Metrics surface: both /api/v1/metrics (JSON) and
/api/v1/metrics/prometheus (text exposition v0.0.4); inventory of
certctl_certificate_* gauges + certctl_issuance_duration_seconds
per-issuer-type histogram + certctl_uptime_seconds.
- Prometheus library vs hand-rolled exposition: explicit scope
statement — hand-rolled fmt.Fprintf is intentional for v2.x given
the shallow metric surface; client_golang migration tracked as
v3 item (closes OPS-M1).
- Tracing: explicit deferral — no OTel SDK setup, OTel packages
are indirect-only in go.mod, no spans, no OTLP exporter; tracked
as v3 item; in the meantime structured logs carry request_id and
certctl_issuance_duration_seconds carries the per-issuer latency
signal (closes OPS-M2).
- Logging: structured JSON via log/slog; CERTCTL_LOG_LEVEL control;
no key material / bearer tokens / session cookies in log lines.
- Rate-limit semantics under restarts + replicas: per-process,
in-memory, reset-on-restart, NOT shared across replicas; full
inventory of the 5 limiter call sites (break-glass login,
SCEP/Intune per-device, EST per-principal CSR, EST HTTP-Basic
source-IP, ACME per-account); multi-replica + sticky-session
implications; database-backed sliding window deferred to v3
(closes D8).
- Performance harness scope: cross-references the explicit
'What it explicitly does NOT measure' list in
deploy/test/loadtest/README.md (closes LOW-7 + finding 7).
docs/operator/runbooks/postgres-backup.md — operator-runnable
backup procedure:
- Inventory of what to back up (DB + operator-managed file
material that lives outside the DB: CA keys, RA keys, OCSP
responder keys, trust bundles).
- Logical backup recipe with docker-compose + Kubernetes variants,
integrity verification step, off-host storage step.
- Physical / PITR recipe pointing at pgbackrest / wal-g
(certctl ships nothing here — standard PostgreSQL DBA work).
- Three sample automation paths (in-cluster Postgres → S3 CronJob,
managed Postgres PITR, self-hosted VM systemd timer + restic).
- Quarterly restore-dry-run procedure.
- Helm CronJob template deliberately not shipped — three
documented reasons (deployment topology / secret-management
integration / off-host storage all vary by operator) plus
roadmap entry for shipping a starter template when a real
operator asks for one (closes D6 + OPS-H1).
Both new docs wired into docs/README.md Operator + Runbooks tables.
D5 (ServiceMonitor) and T9 (canonical k6 load-test) were already
shipped in Bundle 3 (deploy/helm/certctl/templates/servicemonitor.yaml)
and in deploy/test/loadtest/ + .github/workflows/loadtest.yml
respectively; this bundle doesn't touch them — it just records the
closure in the audit HTML.
Verified:
bash scripts/ci-guards/G-3-env-docs-drift.sh # PASS
bash scripts/ci-guards/doc-rot-detector.sh # PASS
All 35 scripts/ci-guards/*.sh green.
215 lines
11 KiB
Markdown
215 lines
11 KiB
Markdown
# Observability — what certctl emits, what it doesn't, and what survives a restart
|
||
|
||
> Last reviewed: 2026-05-13
|
||
|
||
Use this when:
|
||
- You're sizing certctl's observability surface against your existing
|
||
metrics + tracing + logging stack and want to know exactly what
|
||
drops in cleanly and what gaps you'll need to bridge.
|
||
- You're investigating a "weird metric" or planning a Grafana
|
||
dashboard and need the canonical list of what's exposed.
|
||
- You're running multi-replica or restarting frequently and need to
|
||
understand which counters reset.
|
||
|
||
certctl's observability posture is deliberately minimal-but-honest:
|
||
ship the surfaces an operator actually needs to wire into a Prometheus
|
||
+ Grafana + Loki stack, and don't make claims the implementation
|
||
can't back. This document is the canonical statement of what's
|
||
emitted, what's deferred, and why.
|
||
|
||
## Metrics — what's emitted
|
||
|
||
certctl exposes metrics through two endpoints on the control plane:
|
||
|
||
| Endpoint | Content-Type | Audience |
|
||
|---|---|---|
|
||
| `GET /api/v1/metrics` | `application/json` | Dashboards that prefer JSON, ad-hoc curl |
|
||
| `GET /api/v1/metrics/prometheus` | `text/plain; version=0.0.4; charset=utf-8` (Prometheus exposition) | Prometheus, Grafana Agent, Datadog Agent, Victoria Metrics, any OpenMetrics-compatible scraper |
|
||
|
||
The Prometheus endpoint emits standard `# HELP` / `# TYPE` / metric
|
||
lines following the conventions at
|
||
[prometheus.io/docs/instrumenting/exposition_formats](https://prometheus.io/docs/instrumenting/exposition_formats/).
|
||
Metric names are lowercase, snake_case, and prefixed with `certctl_`.
|
||
|
||
The implementation is at
|
||
[`internal/api/handler/metrics.go`](../../internal/api/handler/metrics.go).
|
||
|
||
### What's covered
|
||
|
||
Run the endpoint against a live deployment for the authoritative list
|
||
(it expands as the service ships more metrics). At time of writing the
|
||
exposition includes:
|
||
|
||
- Certificate-inventory gauges: `certctl_certificate_total`,
|
||
`certctl_certificate_active`, `certctl_certificate_expiring_soon`,
|
||
`certctl_certificate_expired`, `certctl_certificate_revoked`.
|
||
- Per-issuer-type issuance histograms:
|
||
`certctl_issuance_duration_seconds{issuer_type=…}` (the 2026-05-01
|
||
issuer-coverage audit closure #4 — this is the load-bearing metric
|
||
for per-issuer SLOs).
|
||
- Server uptime: `certctl_uptime_seconds`.
|
||
|
||
### Prometheus library vs hand-rolled exposition (acquisition diligence)
|
||
|
||
certctl writes Prometheus exposition format with `fmt.Fprintf` from
|
||
the metrics handler, not via the `github.com/prometheus/client_golang`
|
||
library. This is intentional for v2.x:
|
||
|
||
- The metric surface is shallow (gauges + a handful of histograms with
|
||
static labels). The client library's value is on the registration +
|
||
thread-safe accumulation side, neither of which is load-bearing for
|
||
the current surface.
|
||
- The exposition output is pinned to the spec version explicitly
|
||
(`version=0.0.4`) and is unit-tested against expected output at
|
||
`internal/api/handler/stats_handler_test.go`.
|
||
- Swapping in `client_golang` is a mechanical migration when the
|
||
metric surface grows (per-connector counters + RED-method histograms
|
||
on every handler are the natural next surface), but it has no
|
||
operator-visible behavior change today.
|
||
|
||
The migration is on the
|
||
[WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) as a v3 item. If
|
||
you're an acquirer reading this: the question to ask is "does the
|
||
metric surface meet our SLO needs today" — not "is the right library
|
||
under the hood." If the answer to the first question is yes, the
|
||
second is a refactor, not a feature gap.
|
||
|
||
## Tracing — explicitly not yet shipped
|
||
|
||
certctl does **not** ship distributed tracing instrumentation today:
|
||
|
||
- No OpenTelemetry SDK setup in `cmd/server/main.go`.
|
||
- No OTLP exporter wired into outbound calls (issuer connectors,
|
||
agent enrollment, etc.).
|
||
- The `go.opentelemetry.io/otel` packages that appear in
|
||
[`go.mod`](../../go.mod) are indirect-only — they're transitive
|
||
dependencies of `coreos/go-oidc` and similar.
|
||
|
||
This is honest: there is no in-process tracing surface to monitor,
|
||
correlate, or sample. If your environment requires end-to-end traces
|
||
across the certctl control plane + agents + issuer backends, this is
|
||
a gap you would close on the certctl side as part of a v3 work item.
|
||
Until then:
|
||
|
||
- Structured logs include a `request_id` you can correlate across
|
||
the server log stream. See
|
||
[`internal/api/middleware/request_id.go`](../../internal/api/middleware/request_id.go).
|
||
- The Prometheus histogram
|
||
`certctl_issuance_duration_seconds{issuer_type=…}` carries the
|
||
same per-issuer latency signal a trace span would, just without
|
||
the per-request fan-out.
|
||
|
||
OpenTelemetry instrumentation is tracked in
|
||
[WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) as a v3 item.
|
||
|
||
## Logging
|
||
|
||
certctl emits structured JSON logs to stdout via the stdlib
|
||
`log/slog` package. Every line carries `time`, `level`, `msg`, and —
|
||
where relevant — `request_id`, `actor_id`, and a contextual subject
|
||
(`certificate_id`, `issuer_id`, `agent_id`, etc.).
|
||
|
||
Log level is controlled by `CERTCTL_LOG_LEVEL` (`debug` / `info` /
|
||
`warn` / `error`); defaults to `info`. There is no in-process log
|
||
ingest — operators are expected to collect from container stdout
|
||
into their existing log pipeline (Loki, CloudWatch Logs, Datadog,
|
||
ELK, Splunk, etc.).
|
||
|
||
No log line contains private-key material, bearer tokens, OIDC
|
||
client secrets, or session cookies. The break-glass login path
|
||
explicitly scrubs the password before it reaches the audit subsystem
|
||
(see [`docs/operator/auth-threat-model.md`](auth-threat-model.md) §
|
||
"Break-glass token leak").
|
||
|
||
## Rate-limit behavior under restarts and replicas
|
||
|
||
Where rate limits exist, they are **per-process, in-memory,
|
||
reset-on-restart, and not shared across replicas**. This matters for
|
||
multi-replica deployments and for any compliance posture that asks
|
||
"what limits apply globally vs per-pod."
|
||
|
||
### Inventory
|
||
|
||
| Limiter | Scope | Window | Cap | Survives restart? | Shared across replicas? |
|
||
|---|---|---|---|---|---|
|
||
| Break-glass login (per source-IP) | `internal/api/handler/auth_breakglass.go` | 60s | 5 attempts | No | No |
|
||
| SCEP/Intune per-device challenge | `internal/scep/intune/` | 60s | configurable (`*_PER_MINUTE`) | No | No |
|
||
| EST per-principal CSR enrollment | `internal/est/` | 60s | configurable | No | No |
|
||
| EST HTTP-Basic source-IP failed-auth | `internal/est/` | 60s | configurable | No | No |
|
||
| ACME per-account orders / key-change / challenge-respond | `internal/service/acme.go` | 1h | configurable | No | No |
|
||
|
||
All five use the shared `internal/ratelimit/sliding_window.go`
|
||
primitive. Buckets live in a single per-process map guarded by a
|
||
mutex; the package-level cap prevents unbounded growth under
|
||
adversarial key cardinality (default 100,000 keys; oldest-by-newest-
|
||
timestamp evicted under pressure).
|
||
|
||
### Implications for multi-replica deployments
|
||
|
||
- **Effective per-replica cap is the documented cap.** A 2-replica
|
||
deployment lets through up to 2× the per-key window cap before
|
||
either replica rejects.
|
||
- **Restart resets the bucket.** A `kubectl rollout restart` empties
|
||
the in-memory windows on every replica. An attacker who notices
|
||
this could in principle re-issue burst attempts after every roll;
|
||
the threat model accepts this because rollouts are operator-driven
|
||
and the relevant endpoints already require credentials.
|
||
- **No cross-replica fan-out.** Rate-limit decisions on replica A
|
||
are not visible to replica B. Sticky-session ingress routing (with
|
||
`service.spec.sessionAffinity: ClientIP` on Kubernetes or the
|
||
equivalent on your load balancer) tightens the effective cap to
|
||
per-replica + per-source-IP rather than per-replica + per-source-IP
|
||
for whichever pod the request happened to land on.
|
||
|
||
If your threat model requires globally-enforced rate limits across
|
||
replicas, the implementation surface is roughly: swap the per-process
|
||
map for a database-backed sliding window (or a Redis-backed equivalent
|
||
if you already run Redis). This is on the
|
||
[WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) as a v3 item;
|
||
nothing in the certctl threat model today requires it.
|
||
|
||
### Where these numbers live
|
||
|
||
The configurable caps are exposed as `CERTCTL_*_PER_MINUTE` /
|
||
`CERTCTL_ACME_*_PER_HOUR` env vars — see the
|
||
[security posture](security.md) doc for the operator-facing
|
||
configuration surface. The hard-coded ones (break-glass 5/min) are
|
||
intentionally non-configurable as a defense-in-depth measure; the
|
||
auth subsystem owns that policy decision.
|
||
|
||
## Performance harness scope
|
||
|
||
The load-test harness at [`deploy/test/loadtest/`](../../deploy/test/loadtest/)
|
||
covers the API-tier hot paths (issuance acceptance + cert list). It
|
||
does NOT load-test issuer-connector round-trips (you'd be load-
|
||
testing someone else's API), full multi-RTT ACME enrollment flows,
|
||
bulk-revoke / bulk-renew admin paths, or scheduler concurrency under
|
||
bulk renewal. Each exclusion is justified in
|
||
[`deploy/test/loadtest/README.md`](../../deploy/test/loadtest/README.md)
|
||
under "What it explicitly does NOT measure." If your evaluation
|
||
requires a benchmark on one of those exclusions, the right next step
|
||
is a follow-up scenario in that directory.
|
||
|
||
The per-component benchmarks ship in-tree as Go `Benchmark*`
|
||
functions:
|
||
- `internal/auth/session/bench_test.go` — session signing + validation
|
||
steady state and cold-process timing.
|
||
- `internal/auth/oidc/bench_test.go` — OIDC verify steady state.
|
||
- `internal/auth/oidc/bench_keycloak_test.go` — OIDC cold-cache timing
|
||
(gated `//go:build integration`).
|
||
|
||
Authoritative benchmark numbers + threshold contracts:
|
||
[`docs/operator/auth-benchmarks.md`](auth-benchmarks.md) (auth
|
||
subsystem) and [`docs/operator/performance-baselines.md`](performance-baselines.md)
|
||
(general API tier).
|
||
|
||
## Related reading
|
||
|
||
- [`docs/operator/security.md`](security.md) — the broader hardening
|
||
posture; this document is its observability subset.
|
||
- [`docs/operator/performance-baselines.md`](performance-baselines.md) — operator-runnable benchmarks against the API tier
|
||
- [`docs/operator/auth-benchmarks.md`](auth-benchmarks.md) — session
|
||
+ OIDC validation timings + threshold contracts
|
||
- [`deploy/test/loadtest/README.md`](../../deploy/test/loadtest/README.md) — k6 load-test harness scope + threshold contract
|
||
- [`docs/operator/runbooks/postgres-backup.md`](runbooks/postgres-backup.md) — operator-run backup recipe (separate file because it's a procedural runbook, not an observability claim)
|