docs(b12): observability reference + Postgres backup runbook

Closes acquisition-diligence Bundle 12 — Observability, DR, Operations Receipts, And Performance Proof. Source IDs: D5, D6, D8, T9, finding 7, OPS-H1, OPS-M1, OPS-M2, LOW-7. Two new operator-facing references; both non-audit-framed per the Bundle 5 doc-placement policy. docs/operator/observability.md — single canonical statement of what certctl emits, what it doesn't, and what survives a restart: - Metrics surface: both /api/v1/metrics (JSON) and /api/v1/metrics/prometheus (text exposition v0.0.4); inventory of certctl_certificate_* gauges + certctl_issuance_duration_seconds per-issuer-type histogram + certctl_uptime_seconds. - Prometheus library vs hand-rolled exposition: explicit scope statement — hand-rolled fmt.Fprintf is intentional for v2.x given the shallow metric surface; client_golang migration tracked as v3 item (closes OPS-M1). - Tracing: explicit deferral — no OTel SDK setup, OTel packages are indirect-only in go.mod, no spans, no OTLP exporter; tracked as v3 item; in the meantime structured logs carry request_id and certctl_issuance_duration_seconds carries the per-issuer latency signal (closes OPS-M2). - Logging: structured JSON via log/slog; CERTCTL_LOG_LEVEL control; no key material / bearer tokens / session cookies in log lines. - Rate-limit semantics under restarts + replicas: per-process, in-memory, reset-on-restart, NOT shared across replicas; full inventory of the 5 limiter call sites (break-glass login, SCEP/Intune per-device, EST per-principal CSR, EST HTTP-Basic source-IP, ACME per-account); multi-replica + sticky-session implications; database-backed sliding window deferred to v3 (closes D8). - Performance harness scope: cross-references the explicit 'What it explicitly does NOT measure' list in deploy/test/loadtest/README.md (closes LOW-7 + finding 7). docs/operator/runbooks/postgres-backup.md — operator-runnable backup procedure: - Inventory of what to back up (DB + operator-managed file material that lives outside the DB: CA keys, RA keys, OCSP responder keys, trust bundles). - Logical backup recipe with docker-compose + Kubernetes variants, integrity verification step, off-host storage step. - Physical / PITR recipe pointing at pgbackrest / wal-g (certctl ships nothing here — standard PostgreSQL DBA work). - Three sample automation paths (in-cluster Postgres → S3 CronJob, managed Postgres PITR, self-hosted VM systemd timer + restic). - Quarterly restore-dry-run procedure. - Helm CronJob template deliberately not shipped — three documented reasons (deployment topology / secret-management integration / off-host storage all vary by operator) plus roadmap entry for shipping a starter template when a real operator asks for one (closes D6 + OPS-H1). Both new docs wired into docs/README.md Operator + Runbooks tables. D5 (ServiceMonitor) and T9 (canonical k6 load-test) were already shipped in Bundle 3 (deploy/helm/certctl/templates/servicemonitor.yaml) and in deploy/test/loadtest/ + .github/workflows/loadtest.yml respectively; this bundle doesn't touch them — it just records the closure in the audit HTML. Verified: bash scripts/ci-guards/G-3-env-docs-drift.sh # PASS bash scripts/ci-guards/doc-rot-detector.sh # PASS All 35 scripts/ci-guards/*.sh green.
2026-07-26 15:08:12 +00:00 · 2026-05-13 02:09:11 +00:00
parent 072e2af198
commit 57b539c378
3 changed files with 385 additions and 0 deletions
@@ -0,0 +1,214 @@
+# Observability — what certctl emits, what it doesn't, and what survives a restart
+
+> Last reviewed: 2026-05-13
+
+Use this when:
+- You're sizing certctl's observability surface against your existing
+  metrics + tracing + logging stack and want to know exactly what
+  drops in cleanly and what gaps you'll need to bridge.
+- You're investigating a "weird metric" or planning a Grafana
+  dashboard and need the canonical list of what's exposed.
+- You're running multi-replica or restarting frequently and need to
+  understand which counters reset.
+
+certctl's observability posture is deliberately minimal-but-honest:
+ship the surfaces an operator actually needs to wire into a Prometheus
+ Grafana + Loki stack, and don't make claims the implementation
+can't back. This document is the canonical statement of what's
+emitted, what's deferred, and why.
+
+## Metrics — what's emitted
+
+certctl exposes metrics through two endpoints on the control plane:
+
+| Endpoint                          | Content-Type                                                      | Audience                         |
+|---|---|---|
+| `GET /api/v1/metrics`             | `application/json`                                                | Dashboards that prefer JSON, ad-hoc curl |
+| `GET /api/v1/metrics/prometheus`  | `text/plain; version=0.0.4; charset=utf-8` (Prometheus exposition) | Prometheus, Grafana Agent, Datadog Agent, Victoria Metrics, any OpenMetrics-compatible scraper |
+
+The Prometheus endpoint emits standard `# HELP` / `# TYPE` / metric
+lines following the conventions at
+[prometheus.io/docs/instrumenting/exposition_formats](https://prometheus.io/docs/instrumenting/exposition_formats/).
+Metric names are lowercase, snake_case, and prefixed with `certctl_`.
+
+The implementation is at
+[`internal/api/handler/metrics.go`](../../internal/api/handler/metrics.go).
+
+### What's covered
+
+Run the endpoint against a live deployment for the authoritative list
+(it expands as the service ships more metrics). At time of writing the
+exposition includes:
+
+- Certificate-inventory gauges: `certctl_certificate_total`,
+  `certctl_certificate_active`, `certctl_certificate_expiring_soon`,
+  `certctl_certificate_expired`, `certctl_certificate_revoked`.
+- Per-issuer-type issuance histograms:
+  `certctl_issuance_duration_seconds{issuer_type=…}` (the 2026-05-01
+  issuer-coverage audit closure #4 — this is the load-bearing metric
+  for per-issuer SLOs).
+- Server uptime: `certctl_uptime_seconds`.
+
+### Prometheus library vs hand-rolled exposition (acquisition diligence)
+
+certctl writes Prometheus exposition format with `fmt.Fprintf` from
+the metrics handler, not via the `github.com/prometheus/client_golang`
+library. This is intentional for v2.x:
+
+- The metric surface is shallow (gauges + a handful of histograms with
+  static labels). The client library's value is on the registration +
+  thread-safe accumulation side, neither of which is load-bearing for
+  the current surface.
+- The exposition output is pinned to the spec version explicitly
+  (`version=0.0.4`) and is unit-tested against expected output at
+  `internal/api/handler/stats_handler_test.go`.
+- Swapping in `client_golang` is a mechanical migration when the
+  metric surface grows (per-connector counters + RED-method histograms
+  on every handler are the natural next surface), but it has no
+  operator-visible behavior change today.
+
+The migration is on the
+[WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) as a v3 item. If
+you're an acquirer reading this: the question to ask is "does the
+metric surface meet our SLO needs today" — not "is the right library
+under the hood." If the answer to the first question is yes, the
+second is a refactor, not a feature gap.
+
+## Tracing — explicitly not yet shipped
+
+certctl does **not** ship distributed tracing instrumentation today:
+
+- No OpenTelemetry SDK setup in `cmd/server/main.go`.
+- No OTLP exporter wired into outbound calls (issuer connectors,
+  agent enrollment, etc.).
+- The `go.opentelemetry.io/otel` packages that appear in
+  [`go.mod`](../../go.mod) are indirect-only — they're transitive
+  dependencies of `coreos/go-oidc` and similar.
+
+This is honest: there is no in-process tracing surface to monitor,
+correlate, or sample. If your environment requires end-to-end traces
+across the certctl control plane + agents + issuer backends, this is
+a gap you would close on the certctl side as part of a v3 work item.
+Until then:
+
+- Structured logs include a `request_id` you can correlate across
+  the server log stream. See
+  [`internal/api/middleware/request_id.go`](../../internal/api/middleware/request_id.go).
+- The Prometheus histogram
+  `certctl_issuance_duration_seconds{issuer_type=…}` carries the
+  same per-issuer latency signal a trace span would, just without
+  the per-request fan-out.
+
+OpenTelemetry instrumentation is tracked in
+[WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) as a v3 item.
+
+## Logging
+
+certctl emits structured JSON logs to stdout via the stdlib
+`log/slog` package. Every line carries `time`, `level`, `msg`, and —
+where relevant — `request_id`, `actor_id`, and a contextual subject
+(`certificate_id`, `issuer_id`, `agent_id`, etc.).
+
+Log level is controlled by `CERTCTL_LOG_LEVEL` (`debug` / `info` /
+`warn` / `error`); defaults to `info`. There is no in-process log
+ingest — operators are expected to collect from container stdout
+into their existing log pipeline (Loki, CloudWatch Logs, Datadog,
+ELK, Splunk, etc.).
+
+No log line contains private-key material, bearer tokens, OIDC
+client secrets, or session cookies. The break-glass login path
+explicitly scrubs the password before it reaches the audit subsystem
+(see [`docs/operator/auth-threat-model.md`](auth-threat-model.md) §
+"Break-glass token leak").
+
+## Rate-limit behavior under restarts and replicas
+
+Where rate limits exist, they are **per-process, in-memory,
+reset-on-restart, and not shared across replicas**. This matters for
+multi-replica deployments and for any compliance posture that asks
+"what limits apply globally vs per-pod."
+
+### Inventory
+
+| Limiter                                              | Scope                | Window | Cap                            | Survives restart? | Shared across replicas? |
+|---|---|---|---|---|---|
+| Break-glass login (per source-IP)                    | `internal/api/handler/auth_breakglass.go` | 60s   | 5 attempts                     | No                | No                      |
+| SCEP/Intune per-device challenge                     | `internal/scep/intune/`                   | 60s   | configurable (`*_PER_MINUTE`)  | No                | No                      |
+| EST per-principal CSR enrollment                     | `internal/est/`                           | 60s   | configurable                   | No                | No                      |
+| EST HTTP-Basic source-IP failed-auth                 | `internal/est/`                           | 60s   | configurable                   | No                | No                      |
+| ACME per-account orders / key-change / challenge-respond | `internal/service/acme.go`            | 1h    | configurable                   | No                | No                      |
+
+All five use the shared `internal/ratelimit/sliding_window.go`
+primitive. Buckets live in a single per-process map guarded by a
+mutex; the package-level cap prevents unbounded growth under
+adversarial key cardinality (default 100,000 keys; oldest-by-newest-
+timestamp evicted under pressure).
+
+### Implications for multi-replica deployments
+
+- **Effective per-replica cap is the documented cap.** A 2-replica
+  deployment lets through up to 2× the per-key window cap before
+  either replica rejects.
+- **Restart resets the bucket.** A `kubectl rollout restart` empties
+  the in-memory windows on every replica. An attacker who notices
+  this could in principle re-issue burst attempts after every roll;
+  the threat model accepts this because rollouts are operator-driven
+  and the relevant endpoints already require credentials.
+- **No cross-replica fan-out.** Rate-limit decisions on replica A
+  are not visible to replica B. Sticky-session ingress routing (with
+  `service.spec.sessionAffinity: ClientIP` on Kubernetes or the
+  equivalent on your load balancer) tightens the effective cap to
+  per-replica + per-source-IP rather than per-replica + per-source-IP
+  for whichever pod the request happened to land on.
+
+If your threat model requires globally-enforced rate limits across
+replicas, the implementation surface is roughly: swap the per-process
+map for a database-backed sliding window (or a Redis-backed equivalent
+if you already run Redis). This is on the
+[WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) as a v3 item;
+nothing in the certctl threat model today requires it.
+
+### Where these numbers live
+
+The configurable caps are exposed as `CERTCTL_*_PER_MINUTE` /
+`CERTCTL_ACME_*_PER_HOUR` env vars — see the
+[security posture](security.md) doc for the operator-facing
+configuration surface. The hard-coded ones (break-glass 5/min) are
+intentionally non-configurable as a defense-in-depth measure; the
+auth subsystem owns that policy decision.
+
+## Performance harness scope
+
+The load-test harness at [`deploy/test/loadtest/`](../../deploy/test/loadtest/)
+covers the API-tier hot paths (issuance acceptance + cert list). It
+does NOT load-test issuer-connector round-trips (you'd be load-
+testing someone else's API), full multi-RTT ACME enrollment flows,
+bulk-revoke / bulk-renew admin paths, or scheduler concurrency under
+bulk renewal. Each exclusion is justified in
+[`deploy/test/loadtest/README.md`](../../deploy/test/loadtest/README.md)
+under "What it explicitly does NOT measure." If your evaluation
+requires a benchmark on one of those exclusions, the right next step
+is a follow-up scenario in that directory.
+
+The per-component benchmarks ship in-tree as Go `Benchmark*`
+functions:
+- `internal/auth/session/bench_test.go` — session signing + validation
+  steady state and cold-process timing.
+- `internal/auth/oidc/bench_test.go` — OIDC verify steady state.
+- `internal/auth/oidc/bench_keycloak_test.go` — OIDC cold-cache timing
+  (gated `//go:build integration`).
+
+Authoritative benchmark numbers + threshold contracts:
+[`docs/operator/auth-benchmarks.md`](auth-benchmarks.md) (auth
+subsystem) and [`docs/operator/performance-baselines.md`](performance-baselines.md)
+(general API tier).
+
+## Related reading
+
+- [`docs/operator/security.md`](security.md) — the broader hardening
+  posture; this document is its observability subset.
+- [`docs/operator/performance-baselines.md`](performance-baselines.md) — operator-runnable benchmarks against the API tier
+- [`docs/operator/auth-benchmarks.md`](auth-benchmarks.md) — session
+  + OIDC validation timings + threshold contracts
+- [`deploy/test/loadtest/README.md`](../../deploy/test/loadtest/README.md) — k6 load-test harness scope + threshold contract
+- [`docs/operator/runbooks/postgres-backup.md`](runbooks/postgres-backup.md) — operator-run backup recipe (separate file because it's a procedural runbook, not an observability claim)
@@ -0,0 +1,169 @@
+# Runbook: PostgreSQL backup for certctl
+
+> Last reviewed: 2026-05-13
+
+Use this when:
+- You're setting up a new certctl deployment and need a backup policy
+  before going to production.
+- A buyer or auditor asks "where's the backup automation?" and you need
+  to point at the recommended cadence + procedure.
+- You're rotating the encryption key, swapping CAs, or doing any other
+  destructive maintenance and want a snapshot to roll back to.
+
+certctl does not ship a built-in backup daemon. Postgres is the system
+of record for every piece of certctl state that isn't on the
+operator's filesystem (CA keys, OCSP responder keys, SCEP/EST trust
+bundles — see "Operator-managed (NOT in DB)" in the
+[disaster-recovery runbook](disaster-recovery.md#postgres-restore));
+backing it up is treated as a standard PostgreSQL operations task
+that the operator owns end-to-end with their existing tooling.
+
+This page is the recommended recipe.
+
+## What to back up
+
+| Layer                              | Tool                                                                    | Cadence                  |
+|---|---|---|
+| `certctl` database (the row data)  | `pg_dump` (logical) **or** `pg_basebackup` + WAL archive (physical PIT) | ≥ daily, retention ≥ 30d |
+| CA cert + key (`CERTCTL_CA_CERT_PATH`, `CERTCTL_CA_KEY_PATH`) | Out-of-band file backup (operator's existing secret-management tool) | On change |
+| SCEP RA cert + key (per profile)   | Out-of-band file backup                                                 | On change                |
+| OCSP responder keys                | Out-of-band file backup (`CERTCTL_OCSP_RESPONDER_KEY_DIR`)              | On change                |
+| Trust-anchor PEM bundles           | Out-of-band file backup                                                 | On change                |
+| Env vars (auth secret, etc.)       | Operator's secret-management tool (Vault, AWS Secrets Manager, etc.)    | On rotation              |
+
+A backup of only the Postgres database without the operator-managed
+file material is **not a complete restore artifact** — see the
+[disaster-recovery runbook's Postgres-restore section](disaster-recovery.md#postgres-restore)
+for the full inventory. The DR runbook owns the restore procedure;
+this page owns the capture procedure.
+
+## Logical backup (recommended for most deployments)
+
+`pg_dump -Fc` produces a portable compressed dump that's easy to
+restore into a fresh Postgres instance at any version ≥ the dump's
+source version. Best for deployments where the DB is small enough
+that a full logical dump fits the backup window (rough rule of thumb:
+under a million `managed_certificates` rows + corresponding history).
+
+### docker-compose
+
+```bash
+# 1. Snapshot. Run from any host that can reach the postgres container.
+TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
+docker compose -f deploy/docker-compose.yml exec -T postgres \
+  pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
+  > "certctl-${TIMESTAMP}.dump"
+
+# 2. Verify integrity (catch transport / truncation bugs early).
+docker run --rm -v "$PWD:/dumps" -w /dumps postgres:16-alpine \
+  pg_restore --list "certctl-${TIMESTAMP}.dump" > /dev/null \
+  && echo "OK: pg_restore --list parses the dump cleanly" \
+  || { echo "CORRUPT DUMP"; exit 1; }
+
+# 3. Move to durable storage (S3, GCS, NFS, encrypted-at-rest blob
+# storage of your choice). DO NOT leave the dump on the certctl host
+# alone — that defeats the purpose of having a backup.
+aws s3 cp "certctl-${TIMESTAMP}.dump" "s3://your-bucket/certctl/"
+```
+
+### Kubernetes (with the bundled Helm chart)
+
+```bash
+# 1. Snapshot via kubectl exec into the postgres StatefulSet pod.
+TIMESTAMP=$(date -u +%Y%m%dT%H%M%SZ)
+NAMESPACE=certctl
+kubectl exec -n "$NAMESPACE" statefulset/postgres -- \
+  pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
+  > "certctl-${TIMESTAMP}.dump"
+
+# 2. Same verification step as above.
+# 3. Same off-host storage step as above.
+```
+
+### Restore (cross-reference)
+
+The restore procedure lives in
+[disaster-recovery.md § Postgres restore](disaster-recovery.md#postgres-restore).
+The key reminders: stop certctl first, restore the DB, run any
+migrations newer than the snapshot, truncate the CRL + OCSP caches,
+then restart.
+
+## Physical / PITR backup (large fleets, RPO < 1h)
+
+Logical dumps have a coarse RPO (the last successful dump). For
+deployments where ≤ 1h of cert-issuance history loss is unacceptable,
+pair Postgres physical backups with continuous WAL archiving:
+
+- `pg_basebackup` for the initial seed
+- `archive_command = '<your-WAL-archiver>'` in `postgresql.conf` to
+  ship every WAL segment off the host as it closes
+- `pgbackrest` or `wal-g` for the operational layer (both are
+  battle-tested, support encryption, and integrate cleanly with S3 /
+  GCS / Azure Blob)
+
+certctl ships nothing in this layer — it's standard PostgreSQL DBA
+work, and shipping a bespoke recipe would just be a worse version of
+what `pgbackrest` already does. The
+[pgbackrest configuration guide](https://pgbackrest.org/configuration.html)
+is the authoritative reference.
+
+## Automation paths
+
+This is the gap an acquisition reviewer typically wants to see filled.
+certctl ships no backup CronJob template in the Helm chart — the
+operator owns this layer because:
+
+1. The right tool depends on the deployment topology (in-cluster
+   Postgres vs. managed Postgres vs. self-hosted on a VM).
+2. The right secret-management integration depends on the operator's
+   existing stack (Vault, AWS Secrets Manager, GCP Secret Manager,
+   sealed-secrets, External Secrets).
+3. The right storage backend depends on the operator's existing
+   off-host blob storage.
+
+A bundled CronJob would be a half-answer for any operator with an
+established backup posture, and would have to be torn out before
+production. Three sample recipes that cover the common cases:
+
+- **In-cluster Postgres → S3:** a CronJob running an alpine image with
+  `aws-cli` + the `pg_dump` command above, output piped to
+  `aws s3 cp`. Cosign-signed if your supply-chain policy requires it.
+- **Managed Postgres (AWS RDS / GCP Cloud SQL / Azure DB):** rely on
+  the cloud provider's built-in PITR backup; configure retention
+  ≥ 30 days; the certctl deployment surface is the connection string
+  alone.
+- **Self-hosted VM:** systemd timer + `pg_dump` + `restic` (or
+  `borgbackup`) to encrypted off-host storage.
+
+Tracked in [WORKSPACE-ROADMAP.md](../../../WORKSPACE-ROADMAP.md) as a
+post-v2.1.0 nice-to-have: an opt-in Helm CronJob template for the
+in-cluster-Postgres-to-S3 case as a starter. The right time to ship
+it is when a real operator asks for it; speculatively shipping it
+without that signal would just produce a template every deployment
+ends up rewriting.
+
+## Verification — what to dry-run quarterly
+
+A backup you've never restored is a backup you don't have. Add this
+to your quarterly on-call rotation:
+
+1. Pick the most recent dump from the previous quarter.
+2. Stand up a throwaway Postgres instance (Docker, kind, anything).
+3. `pg_restore -d certctl <the dump>`.
+4. Bring up a certctl-server container pointed at the throwaway DB
+   (`CERTCTL_DATABASE_URL=postgres://certctl:...@throwaway/...`).
+5. Confirm `/api/v1/version` returns 200, `/api/v1/certificates`
+   lists the expected rows, and the scheduler logs show no
+   migration-version mismatch.
+6. Tear down. Note the timing in your DR registry.
+
+The [disaster-recovery runbook](disaster-recovery.md) covers what to
+do when this dry-run reveals a gap.
+
+## Related reading
+
+- [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) — the restore companion
+- [`docs/operator/secret-custody.md`](../secret-custody.md) — what
+  the operator-managed file material (CA keys, RA keys, trust
+  anchors) contains, why it lives outside the DB, and what it costs
+  to lose