deploy(helm): close Phase 4 — chart surface + DR + ops runbooks

Phase 4 of the certctl architecture diligence remediation closure. Seven findings, all in deploy/helm/certctl/. DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml Operator opt-in via backup.enabled=true. Default OFF. CronJob runs pg_dump --format=custom --no-owner --no-acl --dbname=certctl matching the canonical shape in docs/operator/runbooks/postgres-backup.md (so manual and automated dumps are byte-identical). Sink: PVC (default) OR S3 via aws-cli. Documented as in-cluster-Postgres only — managed DB deployments rely on their provider's PITR. DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook deploy/helm/certctl/templates/migration-job.yaml — runs `certctl-server --migrate-only` before the server Deployment rolls. The --migrate-only flag (new in cmd/server/main.go) is a hermetic schema-mutation pass: load config, open DB pool, run RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler, no signing setup. Server's boot-time RunMigrations call is now gated on CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips the boot path (the hook owns the work). Default still runs at boot, so Compose / VM / bare-metal deploys are unchanged. migrations.viaHook: false in values.yaml (off by default). DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields deploy/helm/certctl/templates/postgres-statefulset.yaml adds: spec.updateStrategy.type: OnDelete spec.podManagementPolicy: OrderedReady Operator-controlled Postgres upgrades (the OnDelete strategy means a chart template tweak no longer triggers an immediate Postgres restart). OrderedReady aligns with the standard Postgres-on-Kubernetes pattern for any future HA work. DEPL-M5 (Med) — per-fleet-size resource ladder documentation deploy/helm/certctl/values.yaml — extended comments next to server.resources + agent.resources documenting: "≤ 500 certs / 100 agents" → defaults are validated "5K certs / 1K agents" → starter suggestions, TBD Phase 8 "50K certs / 10K agents" → starter suggestions, TBD Phase 8 Numbers for the small-fleet case derive from the measured baselines in docs/operator/performance-baselines.md (50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger fleet numbers explicitly marked TBD pending Phase 8 load-test runs — operators tune empirically until then. DEPL-L1 (Low) — Helm rollback runbook docs/operator/runbooks/rollback.md — covers helm rollback mechanics, the schema-migration manual-cleanup path (when *.down.sql files apply vs. when full restore is the only safe path), and the per-migration-class safe-to-rollback table. DEPL-L2 (Low) — Prometheus AlertManager rules deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via monitoring.prometheusRules.enabled=true. Default OFF. Four starter rules using verified metric names from internal/api/handler/metrics.go: CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon) CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h) CertctlJobFailureRateHigh (failure rate over 5% for 15m) CertctlIssuanceFailures (any failures over 15m window) All thresholds operator-tunable via monitoring.prometheusRules.thresholds.* in values. DEPL-L3 (Low) — Prometheus bearer-token setup runbook docs/operator/runbooks/prometheus-bearer-token.md — documents the API-key + Secret + values wiring for the RBAC-gated /api/v1/metrics/prometheus scrape endpoint. End-to-end procedure with troubleshooting steps + rotation guide. CI guard: scripts/ci-guards/helm-templates-lint.sh Six-combo matrix: defaults / backup PVC / backup S3 / prometheusRules / migrations.viaHook / all-on. Each runs helm template + checks render success. helm lint also gated. Wired into the auto-pickup loop in .github/workflows/ci.yml; azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1 RED-2) installs helm v3.16.0 on the runner. Verification (all pass): ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml # 2 matches helm template deploy/helm/certctl/ --set backup.enabled=true \ --set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \ | grep -E "kind: (CronJob|PrometheusRule|Job)" # 3 matches helm lint deploy/helm/certctl/ # 0 failed ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md bash scripts/ci-guards/helm-templates-lint.sh # 6/6 matrix combinations pass Go build clean (cmd/server compiles, migrate-only path verified by the build target). YAML validated. Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2 cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1 cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4 cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5 cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1 cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2 cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
2026-07-26 13:58:13 +00:00 · 2026-05-14 00:58:00 +00:00
parent b2284ef2a4
commit d6f4d5c5e8
10 changed files with 1223 additions and 11 deletions
@@ -0,0 +1,243 @@
+# Runbook: Prometheus bearer token for the metrics scrape endpoint
+
+> Last reviewed: 2026-05-14
+
+Use this when:
+- You're enabling Prometheus Operator scraping via the Helm chart's
+  `monitoring.serviceMonitor.enabled` toggle.
+- Your Prometheus scrapes are returning 401 against
+  `/api/v1/metrics/prometheus`.
+- An auditor asks "how is the metrics endpoint authenticated?"
+
+## The constraint
+
+The certctl server exposes Prometheus metrics at
+`/api/v1/metrics/prometheus`. This endpoint is **RBAC-gated on the
+`metrics.read` permission** (per `internal/api/router/router.go`).
+Like every other gated handler, it requires an authenticated actor
+holding that permission — there is no anonymous-scrape path.
+
+The rationale: the metrics payload includes operational counters
+(cert counts by status, agent counts, issuance failure rates) that
+a public-facing observer should not see. Most certctl deployments
+expose a reverse proxy / load balancer to the wider network; the
+auth gate on `/api/v1/metrics/prometheus` prevents an external
+observer from learning operational state via the metrics endpoint
+even when the proxy itself is reachable.
+
+## What you need to set up
+
+Three pieces:
+
+1. **An API key with `metrics.read` permission** (and only that
+   permission — least-privilege).
+2. **A Kubernetes Secret** holding that API key.
+3. **`monitoring.serviceMonitor.bearerTokenSecret`** in the chart's
+   values pointing at the Secret.
+
+## Step 1: Create the metrics-read role + API key
+
+The chart's seed migration ships a `metrics-read` role-template, but
+some operators want a dedicated identity per scrape source. Both
+approaches work; the dedicated-identity path is below.
+
+```bash
+# 1. Bootstrap or impersonate a session with auth.role.assign +
+#    auth.apikey.create permissions (admin actor is fine).
+
+# 2. Create a role with only metrics.read.
+curl -sS --cacert ./ca.crt -X POST \
+  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
+  -H "Content-Type: application/json" \
+  https://certctl.your-org.example/api/v1/auth/roles \
+  -d '{"id":"r-prometheus-scrape","name":"Prometheus scrape","permissions":["metrics.read"]}'
+
+# 3. Create an actor that holds the role.
+curl -sS --cacert ./ca.crt -X POST \
+  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
+  -H "Content-Type: application/json" \
+  https://certctl.your-org.example/api/v1/auth/actors \
+  -d '{"id":"actor-prometheus","name":"Prometheus scrape","roles":["r-prometheus-scrape"]}'
+
+# 4. Mint an API key for the actor. The response includes a
+#    `key_value` field that's only returned ONCE — capture it.
+curl -sS --cacert ./ca.crt -X POST \
+  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
+  -H "Content-Type: application/json" \
+  https://certctl.your-org.example/api/v1/auth/apikeys \
+  -d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token"}' \
+  | tee /tmp/prom-key.json
+
+# Extract just the secret material:
+jq -r '.key_value' /tmp/prom-key.json
+```
+
+The mint endpoint returns the API key plaintext exactly once. The
+server stores only a constant-time-comparable hash; if you lose the
+key value, mint a new one.
+
+## Step 2: Create the Kubernetes Secret
+
+```bash
+NAMESPACE=certctl
+API_KEY=$(jq -r '.key_value' /tmp/prom-key.json)
+
+kubectl create secret generic certctl-prometheus-key \
+  -n "$NAMESPACE" \
+  --from-literal=api-key="$API_KEY"
+```
+
+Now scrub the temporary file:
+
+```bash
+shred -u /tmp/prom-key.json
+```
+
+## Step 3: Wire the Secret into the chart values
+
+In your `values.yaml` (or `--set` overrides):
+
+```yaml
+monitoring:
+  enabled: true
+  serviceMonitor:
+    enabled: true
+    interval: 30s
+    scrapeTimeout: 10s
+    bearerTokenSecret:
+      name: certctl-prometheus-key
+      key: api-key
+```
+
+Re-apply the chart:
+
+```bash
+helm upgrade certctl . -n "$NAMESPACE" --reuse-values
+```
+
+The rendered ServiceMonitor will now include the `bearerTokenSecret`
+block. Prometheus Operator's reconciler picks it up and injects the
+bearer token into the scrape request.
+
+## Verification
+
+```bash
+# 1. Confirm the ServiceMonitor renders with the secret reference
+kubectl get servicemonitor -n "$NAMESPACE" certctl-server -o yaml \
+  | grep -A2 bearerTokenSecret
+
+# Expected:
+#       bearerTokenSecret:
+#         name: certctl-prometheus-key
+#         key: api-key
+
+# 2. Tail the certctl-server logs for the next ~60 seconds (one
+#    Prometheus scrape interval). Look for incoming GET /metrics/prometheus
+#    requests authenticated successfully — no 401s.
+kubectl logs -n "$NAMESPACE" -l app.kubernetes.io/component=server \
+  --tail=100 -f | grep -E "GET /api/v1/metrics/prometheus|metrics-scrape"
+
+# 3. From the Prometheus UI's "Targets" page, the certctl-server
+#    target should be UP and last-scrape-error empty. If it's
+#    showing 401, the bearer token isn't reaching the request — see
+#    troubleshooting below.
+```
+
+## Troubleshooting
+
+### Prometheus target shows 401
+
+Three possible causes:
+
+1. **Wrong Secret name / key.** Run
+   `kubectl get secret -n "$NAMESPACE" certctl-prometheus-key -o yaml`
+   and confirm the `data.api-key` field exists with a base64-encoded
+   non-empty value. The Secret's data field name must match the
+   `bearerTokenSecret.key` value in `monitoring.serviceMonitor`.
+2. **API key doesn't have `metrics.read`.** Hit the gating endpoint
+   manually from inside the cluster with the same key:
+   ```bash
+   kubectl run --rm -it --image=curlimages/curl debug -- \
+     curl -sS -H "Authorization: Bearer <API_KEY>" \
+     https://certctl-server.certctl.svc.cluster.local:8443/api/v1/metrics/prometheus
+   ```
+   A 401 here means the role doesn't include `metrics.read`. A 403
+   means the role exists but the API key isn't assigned to it.
+3. **TLS verification failure (not a 401, but masquerading as one in
+   Prometheus's logs).** The default ServiceMonitor template sets
+   `insecureSkipVerify: true` to support demos — production deploys
+   should set `tlsConfig.caFile` or `tlsConfig.ca.secret` per the
+   ServiceMonitor docs.
+
+### Prometheus target shows TLS errors
+
+`monitoring.serviceMonitor.tlsConfig` overrides the default. Three
+patterns:
+
+```yaml
+# Pattern 1: trust the system CA bundle (production behind a real CA)
+tlsConfig:
+  caFile: /etc/ssl/certs/ca-certificates.crt
+  serverName: certctl.your-org.example
+
+# Pattern 2: trust a CA from a Secret mounted by Prometheus Operator
+tlsConfig:
+  ca:
+    secret:
+      name: certctl-ca
+      key: ca.crt
+  serverName: certctl.your-org.example
+
+# Pattern 3: skip verification (DEMO ONLY — DO NOT USE IN PRODUCTION)
+tlsConfig:
+  insecureSkipVerify: true
+```
+
+The certctl server's self-signed bootstrap cert (default
+`server.tls.existingSecret` from the chart) presents a CN of
+`certctl-server`. If your `serverName` doesn't match, the scrape
+fails with `x509: certificate is valid for certctl-server, not ...`.
+
+## Rotation
+
+API keys are constant-time-compared, stored hashed, and never
+logged. Rotation:
+
+```bash
+# 1. Mint a new key (same actor + role)
+curl -sS --cacert ./ca.crt -X POST \
+  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
+  -H "Content-Type: application/json" \
+  https://certctl.your-org.example/api/v1/auth/apikeys \
+  -d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token-v2"}' \
+  | tee /tmp/prom-key-new.json
+
+# 2. Update the Secret in place
+kubectl create secret generic certctl-prometheus-key \
+  -n certctl \
+  --from-literal=api-key="$(jq -r '.key_value' /tmp/prom-key-new.json)" \
+  --dry-run=client -o yaml | kubectl apply -f -
+
+# 3. Wait one scrape interval; verify the next scrape uses the new key.
+
+# 4. Revoke the old key
+curl -sS --cacert ./ca.crt -X DELETE \
+  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
+  https://certctl.your-org.example/api/v1/auth/apikeys/<OLD_KEY_ID>
+
+# 5. Scrub the temp file
+shred -u /tmp/prom-key-new.json
+```
+
+Prometheus Operator picks up Secret changes automatically — no
+ServiceMonitor edit needed, no Prometheus restart.
+
+## Related reading
+
+- [`docs/operator/rbac.md`](../rbac.md) — the full RBAC primitive,
+  permission catalogue, and role-assignment workflow.
+- [`docs/operator/security.md`](../security.md) — the broader auth
+  posture including the API key / OIDC / break-glass paths.
+- [`docs/operator/auth-threat-model.md`](../auth-threat-model.md) —
+  why `/api/v1/metrics/prometheus` is gated, and what an
+  unauthenticated leak of metrics data would reveal.
@@ -0,0 +1,193 @@
+# Runbook: Helm rollback for certctl
+
+> Last reviewed: 2026-05-14
+
+Use this when:
+- A `helm upgrade` rolled out a bad release and the operator wants to
+  return to the previous working state.
+- A schema migration shipped a change the operator wants to back out.
+- An emergency change needs reverting and forward-fix isn't yet
+  available.
+
+This page covers `helm rollback` mechanics + the cases where
+rollback is NOT enough on its own (schema migrations are the main
+one).
+
+## What `helm rollback` does
+
+`helm rollback <release> [revision]` re-applies the manifests from a
+previous Helm revision. It re-creates / updates Kubernetes objects to
+match that revision's template output and is safe for:
+
+- **Deployment image bumps:** rolls the container image back to the
+  previous tag. Pods restart with the old image.
+- **ConfigMap / Secret content changes:** old values land in the
+  config; pods that consume them via `envFrom` or volume mounts get
+  the prior values on the next restart.
+- **Resource requests / limits / replica count:** the spec changes
+  back to the prior values. Kubernetes reschedules pods accordingly.
+- **Service / Ingress / NetworkPolicy changes:** networking flips
+  back to the previous shape immediately.
+
+## What `helm rollback` does NOT do
+
+The Kubernetes layer is reversible; the **database schema is not**.
+This is the single most common gap in a rollback plan.
+
+### Schema migrations are forward-only by design
+
+certctl's migrations under `migrations/` are numbered up-migrations
+(`NNNNNN_*.up.sql`) with paired down-migrations
+(`NNNNNN_*.down.sql`) shipped alongside. The `postgres.RunMigrations`
+path applied at server boot only runs the `*.up.sql` files. The
+`*.down.sql` files exist for development reference + a hypothetical
+"surgical revert" path but are **not invoked by `helm rollback`**.
+
+The implication: if `v2.1.0 → v2.2.0` ships migrations 000100,
+000101, 000102 (adding columns, changing constraints, dropping
+indexes), then `helm rollback` to v2.1.0 takes you back to the v2.1.0
+container image — but the database still has migrations 000100-102
+applied. The v2.1.0 server code doesn't know about those columns; it
+either ignores them (best case) or fails to start (if the schema
+diverged in a way the older code can't tolerate).
+
+### When is rollback safe without a schema revert?
+
+Migrations are **additive-only** in 90%+ of cases. The categories:
+
+| Migration class | Safe to roll back without schema revert? | Why |
+|---|---|---|
+| Add column with default | Yes | Old code ignores the new column |
+| Add table | Yes | Old code doesn't reference the table |
+| Add index | Yes | Old code doesn't depend on the index existing |
+| Add CHECK / FOREIGN KEY constraint | Usually yes | Only fails on row data inserted by new code that violates the old code's constraints |
+| Rename column / table | NO | Old code's queries reference the original name |
+| Drop column / table | NO (data loss) | New code already stopped writing the column; old code expects it |
+| Type change (`VARCHAR(40)` → `TEXT`) | Usually yes | Old code's column read still works |
+| Backfill a column | Yes | Old code ignores the backfilled value |
+
+If your upgrade only added columns / tables / indexes, `helm
+rollback` is sufficient. If it renamed or dropped anything, you need
+a database-level revert.
+
+## Procedure: standard rollback (additive-only migrations)
+
+```bash
+# 1. Identify the target revision
+helm history certctl -n <namespace>
+
+# 2. Take a backup BEFORE rolling back (defense in depth — if
+#    rollback exposes a data corruption issue, restore is the only
+#    path back)
+#    See docs/operator/runbooks/postgres-backup.md for the canonical
+#    pg_dump invocation.
+
+# 3. Roll back to the chosen revision
+helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
+
+# 4. Verify
+kubectl get pods -n <namespace> -l app.kubernetes.io/instance=certctl
+kubectl logs -n <namespace> -l app.kubernetes.io/component=server --tail=50
+```
+
+Watch for migration-version mismatch warnings in the server logs. If
+the older server code refuses to start because the schema is ahead
+of what it knows about, escalate to "rollback with schema revert."
+
+## Procedure: rollback with schema revert
+
+This is the rare case. Use it when:
+- A column / table was renamed or dropped in the rolled-up release.
+- The older code refuses to start with the newer schema.
+
+```bash
+# 1. Take a fresh backup right NOW (the current schema is what we're
+#    reverting from; if anything goes wrong we want a clean
+#    forward-recovery option)
+kubectl exec -n <namespace> statefulset/certctl-postgres -- \
+  pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
+  > "certctl-pre-rollback-$(date -u +%Y%m%dT%H%M%SZ).dump"
+
+# 2. Stop the server Deployment to prevent it from writing to the
+#    database during the revert
+kubectl scale deploy/certctl-server -n <namespace> --replicas=0
+
+# 3. Apply the relevant *.down.sql files manually, one at a time, in
+#    reverse migration-number order. Example for reverting two
+#    migrations:
+NEW=000102  # newest migration on the running schema
+OLD=000100  # oldest migration to revert (inclusive)
+for MIG in 000102 000101 000100; do
+  kubectl exec -i -n <namespace> statefulset/certctl-postgres -- \
+    psql --user=certctl --dbname=certctl \
+    < migrations/${MIG}_*.down.sql
+done
+
+# 4. Manually update the schema_migrations table to reflect the
+#    reverted state (the migration runner's bookkeeping)
+kubectl exec -n <namespace> statefulset/certctl-postgres -- \
+  psql --user=certctl --dbname=certctl -c \
+  "DELETE FROM schema_migrations WHERE version > $((OLD - 1));"
+
+# 5. NOW run helm rollback. The server pod will start with a schema
+#    that matches its code.
+helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
+```
+
+The `*.down.sql` files are tested but only against pristine schemas —
+they may not handle every data shape a production database
+accumulates. ALWAYS take a backup first; the down-migrations are
+a recovery tool, not a transactional contract.
+
+## Procedure: full restore (when revert isn't tractable)
+
+When a down-migration would lose data (drop columns / tables that
+hold rows the older code can't read but the newer code populated), a
+full restore is the only safe path. This is the procedure described
+in
+[`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md#postgres-restore).
+The summary:
+
+1. Stop certctl.
+2. Take a backup of the CURRENT schema (defense in depth).
+3. Restore the LAST backup taken BEFORE the bad upgrade.
+4. Roll the Helm release back to the matching code version.
+5. Restart certctl.
+6. Re-run any audited writes that happened in the window between the
+   backup and the bad upgrade (read the audit log; the API surface
+   is recoverable).
+
+The DR runbook owns the canonical commands.
+
+## Common pitfalls
+
+- **Forgetting the backup before rollback.** A schema-revert path is
+  not safe without a fresh backup. If something goes wrong mid-revert
+  and your most recent backup is from last night, you've lost any
+  cert-issuance history between then and now.
+- **Rolling back the chart without rolling back the database state**
+  on a release that included a destructive migration (drop column,
+  drop table). Symptoms: old code starts, queries fail with
+  "column does not exist," server crashes in a loop. Recovery
+  requires schema revert OR full restore.
+- **Letting the agents drift.** `helm rollback` updates the agent
+  DaemonSet's image too — agents on different versions than the
+  server may produce incompatible CSR payloads. After rollback,
+  confirm agent images are at the matching version via
+  `kubectl get daemonset certctl-agent -o jsonpath='{.spec.template.spec.containers[0].image}'`.
+- **GHCR images pinned by digest:** the rollback restores the prior
+  `image:` value from the Helm template. If your operator workflow
+  uses `image.digest` pinning, the digest comes back too — make
+  sure that digest still exists on ghcr.io. They do persist; old
+  tags are never deleted, but a private mirror may have garbage-collected.
+
+## Related reading
+
+- [`docs/operator/runbooks/postgres-backup.md`](postgres-backup.md) —
+  the backup procedure that's the precondition for any
+  schema-revert path.
+- [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) —
+  the full restore procedure when rollback isn't tractable.
+- [`docs/migration/api-keys-to-rbac.md`](../../migration/api-keys-to-rbac.md) —
+  example of a migration that the runtime supports rolling back via
+  feature flag (rare).