deploy(helm): close Phase 4 — chart surface + DR + ops runbooks

Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.

DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
  Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl
  matching the canonical shape in
  docs/operator/runbooks/postgres-backup.md (so manual and
  automated dumps are byte-identical). Sink: PVC (default) OR S3
  via aws-cli. Documented as in-cluster-Postgres only — managed DB
  deployments rely on their provider's PITR.

DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
  deploy/helm/certctl/templates/migration-job.yaml — runs
  `certctl-server --migrate-only` before the server Deployment
  rolls. The --migrate-only flag (new in cmd/server/main.go) is a
  hermetic schema-mutation pass: load config, open DB pool, run
  RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
  no signing setup.

  Server's boot-time RunMigrations call is now gated on
  CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
  the boot path (the hook owns the work). Default still runs at
  boot, so Compose / VM / bare-metal deploys are unchanged.

  migrations.viaHook: false in values.yaml (off by default).

DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
  deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
    spec.updateStrategy.type: OnDelete
    spec.podManagementPolicy: OrderedReady
  Operator-controlled Postgres upgrades (the OnDelete strategy
  means a chart template tweak no longer triggers an immediate
  Postgres restart). OrderedReady aligns with the standard
  Postgres-on-Kubernetes pattern for any future HA work.

DEPL-M5 (Med) — per-fleet-size resource ladder documentation
  deploy/helm/certctl/values.yaml — extended comments next to
  server.resources + agent.resources documenting:
    "≤ 500 certs / 100 agents" → defaults are validated
    "5K certs / 1K agents" → starter suggestions, TBD Phase 8
    "50K certs / 10K agents" → starter suggestions, TBD Phase 8
  Numbers for the small-fleet case derive from the measured
  baselines in docs/operator/performance-baselines.md
  (50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
  fleet numbers explicitly marked TBD pending Phase 8 load-test
  runs — operators tune empirically until then.

DEPL-L1 (Low) — Helm rollback runbook
  docs/operator/runbooks/rollback.md — covers helm rollback
  mechanics, the schema-migration manual-cleanup path (when
  *.down.sql files apply vs. when full restore is the only safe
  path), and the per-migration-class safe-to-rollback table.

DEPL-L2 (Low) — Prometheus AlertManager rules
  deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
  monitoring.prometheusRules.enabled=true. Default OFF. Four
  starter rules using verified metric names from
  internal/api/handler/metrics.go:
    CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
    CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
    CertctlJobFailureRateHigh (failure rate over 5% for 15m)
    CertctlIssuanceFailures (any failures over 15m window)
  All thresholds operator-tunable via
  monitoring.prometheusRules.thresholds.* in values.

DEPL-L3 (Low) — Prometheus bearer-token setup runbook
  docs/operator/runbooks/prometheus-bearer-token.md — documents
  the API-key + Secret + values wiring for the RBAC-gated
  /api/v1/metrics/prometheus scrape endpoint. End-to-end
  procedure with troubleshooting steps + rotation guide.

CI guard: scripts/ci-guards/helm-templates-lint.sh
  Six-combo matrix: defaults / backup PVC / backup S3 /
  prometheusRules / migrations.viaHook / all-on. Each runs helm
  template + checks render success. helm lint also gated.
  Wired into the auto-pickup loop in .github/workflows/ci.yml;
  azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
  RED-2) installs helm v3.16.0 on the runner.

Verification (all pass):
  ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
  grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml  # 2 matches
  helm template deploy/helm/certctl/ --set backup.enabled=true \
    --set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
    | grep -E "kind: (CronJob|PrometheusRule|Job)"  # 3 matches
  helm lint deploy/helm/certctl/  # 0 failed
  ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
  bash scripts/ci-guards/helm-templates-lint.sh  # 6/6 matrix combinations pass

Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
This commit is contained in:
shankar0123
2026-05-14 00:58:00 +00:00
parent b2284ef2a4
commit d6f4d5c5e8
10 changed files with 1223 additions and 11 deletions
@@ -0,0 +1,243 @@
# Runbook: Prometheus bearer token for the metrics scrape endpoint
> Last reviewed: 2026-05-14
Use this when:
- You're enabling Prometheus Operator scraping via the Helm chart's
`monitoring.serviceMonitor.enabled` toggle.
- Your Prometheus scrapes are returning 401 against
`/api/v1/metrics/prometheus`.
- An auditor asks "how is the metrics endpoint authenticated?"
## The constraint
The certctl server exposes Prometheus metrics at
`/api/v1/metrics/prometheus`. This endpoint is **RBAC-gated on the
`metrics.read` permission** (per `internal/api/router/router.go`).
Like every other gated handler, it requires an authenticated actor
holding that permission — there is no anonymous-scrape path.
The rationale: the metrics payload includes operational counters
(cert counts by status, agent counts, issuance failure rates) that
a public-facing observer should not see. Most certctl deployments
expose a reverse proxy / load balancer to the wider network; the
auth gate on `/api/v1/metrics/prometheus` prevents an external
observer from learning operational state via the metrics endpoint
even when the proxy itself is reachable.
## What you need to set up
Three pieces:
1. **An API key with `metrics.read` permission** (and only that
permission — least-privilege).
2. **A Kubernetes Secret** holding that API key.
3. **`monitoring.serviceMonitor.bearerTokenSecret`** in the chart's
values pointing at the Secret.
## Step 1: Create the metrics-read role + API key
The chart's seed migration ships a `metrics-read` role-template, but
some operators want a dedicated identity per scrape source. Both
approaches work; the dedicated-identity path is below.
```bash
# 1. Bootstrap or impersonate a session with auth.role.assign +
# auth.apikey.create permissions (admin actor is fine).
# 2. Create a role with only metrics.read.
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/roles \
-d '{"id":"r-prometheus-scrape","name":"Prometheus scrape","permissions":["metrics.read"]}'
# 3. Create an actor that holds the role.
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/actors \
-d '{"id":"actor-prometheus","name":"Prometheus scrape","roles":["r-prometheus-scrape"]}'
# 4. Mint an API key for the actor. The response includes a
# `key_value` field that's only returned ONCE — capture it.
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/apikeys \
-d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token"}' \
| tee /tmp/prom-key.json
# Extract just the secret material:
jq -r '.key_value' /tmp/prom-key.json
```
The mint endpoint returns the API key plaintext exactly once. The
server stores only a constant-time-comparable hash; if you lose the
key value, mint a new one.
## Step 2: Create the Kubernetes Secret
```bash
NAMESPACE=certctl
API_KEY=$(jq -r '.key_value' /tmp/prom-key.json)
kubectl create secret generic certctl-prometheus-key \
-n "$NAMESPACE" \
--from-literal=api-key="$API_KEY"
```
Now scrub the temporary file:
```bash
shred -u /tmp/prom-key.json
```
## Step 3: Wire the Secret into the chart values
In your `values.yaml` (or `--set` overrides):
```yaml
monitoring:
enabled: true
serviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
bearerTokenSecret:
name: certctl-prometheus-key
key: api-key
```
Re-apply the chart:
```bash
helm upgrade certctl . -n "$NAMESPACE" --reuse-values
```
The rendered ServiceMonitor will now include the `bearerTokenSecret`
block. Prometheus Operator's reconciler picks it up and injects the
bearer token into the scrape request.
## Verification
```bash
# 1. Confirm the ServiceMonitor renders with the secret reference
kubectl get servicemonitor -n "$NAMESPACE" certctl-server -o yaml \
| grep -A2 bearerTokenSecret
# Expected:
# bearerTokenSecret:
# name: certctl-prometheus-key
# key: api-key
# 2. Tail the certctl-server logs for the next ~60 seconds (one
# Prometheus scrape interval). Look for incoming GET /metrics/prometheus
# requests authenticated successfully — no 401s.
kubectl logs -n "$NAMESPACE" -l app.kubernetes.io/component=server \
--tail=100 -f | grep -E "GET /api/v1/metrics/prometheus|metrics-scrape"
# 3. From the Prometheus UI's "Targets" page, the certctl-server
# target should be UP and last-scrape-error empty. If it's
# showing 401, the bearer token isn't reaching the request — see
# troubleshooting below.
```
## Troubleshooting
### Prometheus target shows 401
Three possible causes:
1. **Wrong Secret name / key.** Run
`kubectl get secret -n "$NAMESPACE" certctl-prometheus-key -o yaml`
and confirm the `data.api-key` field exists with a base64-encoded
non-empty value. The Secret's data field name must match the
`bearerTokenSecret.key` value in `monitoring.serviceMonitor`.
2. **API key doesn't have `metrics.read`.** Hit the gating endpoint
manually from inside the cluster with the same key:
```bash
kubectl run --rm -it --image=curlimages/curl debug -- \
curl -sS -H "Authorization: Bearer <API_KEY>" \
https://certctl-server.certctl.svc.cluster.local:8443/api/v1/metrics/prometheus
```
A 401 here means the role doesn't include `metrics.read`. A 403
means the role exists but the API key isn't assigned to it.
3. **TLS verification failure (not a 401, but masquerading as one in
Prometheus's logs).** The default ServiceMonitor template sets
`insecureSkipVerify: true` to support demos — production deploys
should set `tlsConfig.caFile` or `tlsConfig.ca.secret` per the
ServiceMonitor docs.
### Prometheus target shows TLS errors
`monitoring.serviceMonitor.tlsConfig` overrides the default. Three
patterns:
```yaml
# Pattern 1: trust the system CA bundle (production behind a real CA)
tlsConfig:
caFile: /etc/ssl/certs/ca-certificates.crt
serverName: certctl.your-org.example
# Pattern 2: trust a CA from a Secret mounted by Prometheus Operator
tlsConfig:
ca:
secret:
name: certctl-ca
key: ca.crt
serverName: certctl.your-org.example
# Pattern 3: skip verification (DEMO ONLY — DO NOT USE IN PRODUCTION)
tlsConfig:
insecureSkipVerify: true
```
The certctl server's self-signed bootstrap cert (default
`server.tls.existingSecret` from the chart) presents a CN of
`certctl-server`. If your `serverName` doesn't match, the scrape
fails with `x509: certificate is valid for certctl-server, not ...`.
## Rotation
API keys are constant-time-compared, stored hashed, and never
logged. Rotation:
```bash
# 1. Mint a new key (same actor + role)
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/apikeys \
-d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token-v2"}' \
| tee /tmp/prom-key-new.json
# 2. Update the Secret in place
kubectl create secret generic certctl-prometheus-key \
-n certctl \
--from-literal=api-key="$(jq -r '.key_value' /tmp/prom-key-new.json)" \
--dry-run=client -o yaml | kubectl apply -f -
# 3. Wait one scrape interval; verify the next scrape uses the new key.
# 4. Revoke the old key
curl -sS --cacert ./ca.crt -X DELETE \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
https://certctl.your-org.example/api/v1/auth/apikeys/<OLD_KEY_ID>
# 5. Scrub the temp file
shred -u /tmp/prom-key-new.json
```
Prometheus Operator picks up Secret changes automatically — no
ServiceMonitor edit needed, no Prometheus restart.
## Related reading
- [`docs/operator/rbac.md`](../rbac.md) — the full RBAC primitive,
permission catalogue, and role-assignment workflow.
- [`docs/operator/security.md`](../security.md) — the broader auth
posture including the API key / OIDC / break-glass paths.
- [`docs/operator/auth-threat-model.md`](../auth-threat-model.md) —
why `/api/v1/metrics/prometheus` is gated, and what an
unauthenticated leak of metrics data would reveal.
+193
View File
@@ -0,0 +1,193 @@
# Runbook: Helm rollback for certctl
> Last reviewed: 2026-05-14
Use this when:
- A `helm upgrade` rolled out a bad release and the operator wants to
return to the previous working state.
- A schema migration shipped a change the operator wants to back out.
- An emergency change needs reverting and forward-fix isn't yet
available.
This page covers `helm rollback` mechanics + the cases where
rollback is NOT enough on its own (schema migrations are the main
one).
## What `helm rollback` does
`helm rollback <release> [revision]` re-applies the manifests from a
previous Helm revision. It re-creates / updates Kubernetes objects to
match that revision's template output and is safe for:
- **Deployment image bumps:** rolls the container image back to the
previous tag. Pods restart with the old image.
- **ConfigMap / Secret content changes:** old values land in the
config; pods that consume them via `envFrom` or volume mounts get
the prior values on the next restart.
- **Resource requests / limits / replica count:** the spec changes
back to the prior values. Kubernetes reschedules pods accordingly.
- **Service / Ingress / NetworkPolicy changes:** networking flips
back to the previous shape immediately.
## What `helm rollback` does NOT do
The Kubernetes layer is reversible; the **database schema is not**.
This is the single most common gap in a rollback plan.
### Schema migrations are forward-only by design
certctl's migrations under `migrations/` are numbered up-migrations
(`NNNNNN_*.up.sql`) with paired down-migrations
(`NNNNNN_*.down.sql`) shipped alongside. The `postgres.RunMigrations`
path applied at server boot only runs the `*.up.sql` files. The
`*.down.sql` files exist for development reference + a hypothetical
"surgical revert" path but are **not invoked by `helm rollback`**.
The implication: if `v2.1.0 → v2.2.0` ships migrations 000100,
000101, 000102 (adding columns, changing constraints, dropping
indexes), then `helm rollback` to v2.1.0 takes you back to the v2.1.0
container image — but the database still has migrations 000100-102
applied. The v2.1.0 server code doesn't know about those columns; it
either ignores them (best case) or fails to start (if the schema
diverged in a way the older code can't tolerate).
### When is rollback safe without a schema revert?
Migrations are **additive-only** in 90%+ of cases. The categories:
| Migration class | Safe to roll back without schema revert? | Why |
|---|---|---|
| Add column with default | Yes | Old code ignores the new column |
| Add table | Yes | Old code doesn't reference the table |
| Add index | Yes | Old code doesn't depend on the index existing |
| Add CHECK / FOREIGN KEY constraint | Usually yes | Only fails on row data inserted by new code that violates the old code's constraints |
| Rename column / table | NO | Old code's queries reference the original name |
| Drop column / table | NO (data loss) | New code already stopped writing the column; old code expects it |
| Type change (`VARCHAR(40)``TEXT`) | Usually yes | Old code's column read still works |
| Backfill a column | Yes | Old code ignores the backfilled value |
If your upgrade only added columns / tables / indexes, `helm
rollback` is sufficient. If it renamed or dropped anything, you need
a database-level revert.
## Procedure: standard rollback (additive-only migrations)
```bash
# 1. Identify the target revision
helm history certctl -n <namespace>
# 2. Take a backup BEFORE rolling back (defense in depth — if
# rollback exposes a data corruption issue, restore is the only
# path back)
# See docs/operator/runbooks/postgres-backup.md for the canonical
# pg_dump invocation.
# 3. Roll back to the chosen revision
helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
# 4. Verify
kubectl get pods -n <namespace> -l app.kubernetes.io/instance=certctl
kubectl logs -n <namespace> -l app.kubernetes.io/component=server --tail=50
```
Watch for migration-version mismatch warnings in the server logs. If
the older server code refuses to start because the schema is ahead
of what it knows about, escalate to "rollback with schema revert."
## Procedure: rollback with schema revert
This is the rare case. Use it when:
- A column / table was renamed or dropped in the rolled-up release.
- The older code refuses to start with the newer schema.
```bash
# 1. Take a fresh backup right NOW (the current schema is what we're
# reverting from; if anything goes wrong we want a clean
# forward-recovery option)
kubectl exec -n <namespace> statefulset/certctl-postgres -- \
pg_dump --format=custom --no-owner --no-acl --dbname=certctl \
> "certctl-pre-rollback-$(date -u +%Y%m%dT%H%M%SZ).dump"
# 2. Stop the server Deployment to prevent it from writing to the
# database during the revert
kubectl scale deploy/certctl-server -n <namespace> --replicas=0
# 3. Apply the relevant *.down.sql files manually, one at a time, in
# reverse migration-number order. Example for reverting two
# migrations:
NEW=000102 # newest migration on the running schema
OLD=000100 # oldest migration to revert (inclusive)
for MIG in 000102 000101 000100; do
kubectl exec -i -n <namespace> statefulset/certctl-postgres -- \
psql --user=certctl --dbname=certctl \
< migrations/${MIG}_*.down.sql
done
# 4. Manually update the schema_migrations table to reflect the
# reverted state (the migration runner's bookkeeping)
kubectl exec -n <namespace> statefulset/certctl-postgres -- \
psql --user=certctl --dbname=certctl -c \
"DELETE FROM schema_migrations WHERE version > $((OLD - 1));"
# 5. NOW run helm rollback. The server pod will start with a schema
# that matches its code.
helm rollback certctl <revision> -n <namespace> --wait --timeout 5m
```
The `*.down.sql` files are tested but only against pristine schemas —
they may not handle every data shape a production database
accumulates. ALWAYS take a backup first; the down-migrations are
a recovery tool, not a transactional contract.
## Procedure: full restore (when revert isn't tractable)
When a down-migration would lose data (drop columns / tables that
hold rows the older code can't read but the newer code populated), a
full restore is the only safe path. This is the procedure described
in
[`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md#postgres-restore).
The summary:
1. Stop certctl.
2. Take a backup of the CURRENT schema (defense in depth).
3. Restore the LAST backup taken BEFORE the bad upgrade.
4. Roll the Helm release back to the matching code version.
5. Restart certctl.
6. Re-run any audited writes that happened in the window between the
backup and the bad upgrade (read the audit log; the API surface
is recoverable).
The DR runbook owns the canonical commands.
## Common pitfalls
- **Forgetting the backup before rollback.** A schema-revert path is
not safe without a fresh backup. If something goes wrong mid-revert
and your most recent backup is from last night, you've lost any
cert-issuance history between then and now.
- **Rolling back the chart without rolling back the database state**
on a release that included a destructive migration (drop column,
drop table). Symptoms: old code starts, queries fail with
"column does not exist," server crashes in a loop. Recovery
requires schema revert OR full restore.
- **Letting the agents drift.** `helm rollback` updates the agent
DaemonSet's image too — agents on different versions than the
server may produce incompatible CSR payloads. After rollback,
confirm agent images are at the matching version via
`kubectl get daemonset certctl-agent -o jsonpath='{.spec.template.spec.containers[0].image}'`.
- **GHCR images pinned by digest:** the rollback restores the prior
`image:` value from the Helm template. If your operator workflow
uses `image.digest` pinning, the digest comes back too — make
sure that digest still exists on ghcr.io. They do persist; old
tags are never deleted, but a private mirror may have garbage-collected.
## Related reading
- [`docs/operator/runbooks/postgres-backup.md`](postgres-backup.md) —
the backup procedure that's the precondition for any
schema-revert path.
- [`docs/operator/runbooks/disaster-recovery.md`](disaster-recovery.md) —
the full restore procedure when rollback isn't tractable.
- [`docs/migration/api-keys-to-rbac.md`](../../migration/api-keys-to-rbac.md) —
example of a migration that the runtime supports rolling back via
feature flag (rare).