Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.
DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
pg_dump --format=custom --no-owner --no-acl --dbname=certctl
matching the canonical shape in
docs/operator/runbooks/postgres-backup.md (so manual and
automated dumps are byte-identical). Sink: PVC (default) OR S3
via aws-cli. Documented as in-cluster-Postgres only — managed DB
deployments rely on their provider's PITR.
DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
deploy/helm/certctl/templates/migration-job.yaml — runs
`certctl-server --migrate-only` before the server Deployment
rolls. The --migrate-only flag (new in cmd/server/main.go) is a
hermetic schema-mutation pass: load config, open DB pool, run
RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
no signing setup.
Server's boot-time RunMigrations call is now gated on
CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
the boot path (the hook owns the work). Default still runs at
boot, so Compose / VM / bare-metal deploys are unchanged.
migrations.viaHook: false in values.yaml (off by default).
DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
spec.updateStrategy.type: OnDelete
spec.podManagementPolicy: OrderedReady
Operator-controlled Postgres upgrades (the OnDelete strategy
means a chart template tweak no longer triggers an immediate
Postgres restart). OrderedReady aligns with the standard
Postgres-on-Kubernetes pattern for any future HA work.
DEPL-M5 (Med) — per-fleet-size resource ladder documentation
deploy/helm/certctl/values.yaml — extended comments next to
server.resources + agent.resources documenting:
"≤ 500 certs / 100 agents" → defaults are validated
"5K certs / 1K agents" → starter suggestions, TBD Phase 8
"50K certs / 10K agents" → starter suggestions, TBD Phase 8
Numbers for the small-fleet case derive from the measured
baselines in docs/operator/performance-baselines.md
(50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
fleet numbers explicitly marked TBD pending Phase 8 load-test
runs — operators tune empirically until then.
DEPL-L1 (Low) — Helm rollback runbook
docs/operator/runbooks/rollback.md — covers helm rollback
mechanics, the schema-migration manual-cleanup path (when
*.down.sql files apply vs. when full restore is the only safe
path), and the per-migration-class safe-to-rollback table.
DEPL-L2 (Low) — Prometheus AlertManager rules
deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
monitoring.prometheusRules.enabled=true. Default OFF. Four
starter rules using verified metric names from
internal/api/handler/metrics.go:
CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
CertctlJobFailureRateHigh (failure rate over 5% for 15m)
CertctlIssuanceFailures (any failures over 15m window)
All thresholds operator-tunable via
monitoring.prometheusRules.thresholds.* in values.
DEPL-L3 (Low) — Prometheus bearer-token setup runbook
docs/operator/runbooks/prometheus-bearer-token.md — documents
the API-key + Secret + values wiring for the RBAC-gated
/api/v1/metrics/prometheus scrape endpoint. End-to-end
procedure with troubleshooting steps + rotation guide.
CI guard: scripts/ci-guards/helm-templates-lint.sh
Six-combo matrix: defaults / backup PVC / backup S3 /
prometheusRules / migrations.viaHook / all-on. Each runs helm
template + checks render success. helm lint also gated.
Wired into the auto-pickup loop in .github/workflows/ci.yml;
azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
RED-2) installs helm v3.16.0 on the runner.
Verification (all pass):
ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml # 2 matches
helm template deploy/helm/certctl/ --set backup.enabled=true \
--set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
| grep -E "kind: (CronJob|PrometheusRule|Job)" # 3 matches
helm lint deploy/helm/certctl/ # 0 failed
ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
bash scripts/ci-guards/helm-templates-lint.sh # 6/6 matrix combinations pass
Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.
Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
8.1 KiB
Runbook: Prometheus bearer token for the metrics scrape endpoint
Last reviewed: 2026-05-14
Use this when:
- You're enabling Prometheus Operator scraping via the Helm chart's
monitoring.serviceMonitor.enabledtoggle. - Your Prometheus scrapes are returning 401 against
/api/v1/metrics/prometheus. - An auditor asks "how is the metrics endpoint authenticated?"
The constraint
The certctl server exposes Prometheus metrics at
/api/v1/metrics/prometheus. This endpoint is RBAC-gated on the
metrics.read permission (per internal/api/router/router.go).
Like every other gated handler, it requires an authenticated actor
holding that permission — there is no anonymous-scrape path.
The rationale: the metrics payload includes operational counters
(cert counts by status, agent counts, issuance failure rates) that
a public-facing observer should not see. Most certctl deployments
expose a reverse proxy / load balancer to the wider network; the
auth gate on /api/v1/metrics/prometheus prevents an external
observer from learning operational state via the metrics endpoint
even when the proxy itself is reachable.
What you need to set up
Three pieces:
- An API key with
metrics.readpermission (and only that permission — least-privilege). - A Kubernetes Secret holding that API key.
monitoring.serviceMonitor.bearerTokenSecretin the chart's values pointing at the Secret.
Step 1: Create the metrics-read role + API key
The chart's seed migration ships a metrics-read role-template, but
some operators want a dedicated identity per scrape source. Both
approaches work; the dedicated-identity path is below.
# 1. Bootstrap or impersonate a session with auth.role.assign +
# auth.apikey.create permissions (admin actor is fine).
# 2. Create a role with only metrics.read.
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/roles \
-d '{"id":"r-prometheus-scrape","name":"Prometheus scrape","permissions":["metrics.read"]}'
# 3. Create an actor that holds the role.
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/actors \
-d '{"id":"actor-prometheus","name":"Prometheus scrape","roles":["r-prometheus-scrape"]}'
# 4. Mint an API key for the actor. The response includes a
# `key_value` field that's only returned ONCE — capture it.
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/apikeys \
-d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token"}' \
| tee /tmp/prom-key.json
# Extract just the secret material:
jq -r '.key_value' /tmp/prom-key.json
The mint endpoint returns the API key plaintext exactly once. The server stores only a constant-time-comparable hash; if you lose the key value, mint a new one.
Step 2: Create the Kubernetes Secret
NAMESPACE=certctl
API_KEY=$(jq -r '.key_value' /tmp/prom-key.json)
kubectl create secret generic certctl-prometheus-key \
-n "$NAMESPACE" \
--from-literal=api-key="$API_KEY"
Now scrub the temporary file:
shred -u /tmp/prom-key.json
Step 3: Wire the Secret into the chart values
In your values.yaml (or --set overrides):
monitoring:
enabled: true
serviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
bearerTokenSecret:
name: certctl-prometheus-key
key: api-key
Re-apply the chart:
helm upgrade certctl . -n "$NAMESPACE" --reuse-values
The rendered ServiceMonitor will now include the bearerTokenSecret
block. Prometheus Operator's reconciler picks it up and injects the
bearer token into the scrape request.
Verification
# 1. Confirm the ServiceMonitor renders with the secret reference
kubectl get servicemonitor -n "$NAMESPACE" certctl-server -o yaml \
| grep -A2 bearerTokenSecret
# Expected:
# bearerTokenSecret:
# name: certctl-prometheus-key
# key: api-key
# 2. Tail the certctl-server logs for the next ~60 seconds (one
# Prometheus scrape interval). Look for incoming GET /metrics/prometheus
# requests authenticated successfully — no 401s.
kubectl logs -n "$NAMESPACE" -l app.kubernetes.io/component=server \
--tail=100 -f | grep -E "GET /api/v1/metrics/prometheus|metrics-scrape"
# 3. From the Prometheus UI's "Targets" page, the certctl-server
# target should be UP and last-scrape-error empty. If it's
# showing 401, the bearer token isn't reaching the request — see
# troubleshooting below.
Troubleshooting
Prometheus target shows 401
Three possible causes:
- Wrong Secret name / key. Run
kubectl get secret -n "$NAMESPACE" certctl-prometheus-key -o yamland confirm thedata.api-keyfield exists with a base64-encoded non-empty value. The Secret's data field name must match thebearerTokenSecret.keyvalue inmonitoring.serviceMonitor. - API key doesn't have
metrics.read. Hit the gating endpoint manually from inside the cluster with the same key:A 401 here means the role doesn't includekubectl run --rm -it --image=curlimages/curl debug -- \ curl -sS -H "Authorization: Bearer <API_KEY>" \ https://certctl-server.certctl.svc.cluster.local:8443/api/v1/metrics/prometheusmetrics.read. A 403 means the role exists but the API key isn't assigned to it. - TLS verification failure (not a 401, but masquerading as one in
Prometheus's logs). The default ServiceMonitor template sets
insecureSkipVerify: trueto support demos — production deploys should settlsConfig.caFileortlsConfig.ca.secretper the ServiceMonitor docs.
Prometheus target shows TLS errors
monitoring.serviceMonitor.tlsConfig overrides the default. Three
patterns:
# Pattern 1: trust the system CA bundle (production behind a real CA)
tlsConfig:
caFile: /etc/ssl/certs/ca-certificates.crt
serverName: certctl.your-org.example
# Pattern 2: trust a CA from a Secret mounted by Prometheus Operator
tlsConfig:
ca:
secret:
name: certctl-ca
key: ca.crt
serverName: certctl.your-org.example
# Pattern 3: skip verification (DEMO ONLY — DO NOT USE IN PRODUCTION)
tlsConfig:
insecureSkipVerify: true
The certctl server's self-signed bootstrap cert (default
server.tls.existingSecret from the chart) presents a CN of
certctl-server. If your serverName doesn't match, the scrape
fails with x509: certificate is valid for certctl-server, not ....
Rotation
API keys are constant-time-compared, stored hashed, and never logged. Rotation:
# 1. Mint a new key (same actor + role)
curl -sS --cacert ./ca.crt -X POST \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
-H "Content-Type: application/json" \
https://certctl.your-org.example/api/v1/auth/apikeys \
-d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token-v2"}' \
| tee /tmp/prom-key-new.json
# 2. Update the Secret in place
kubectl create secret generic certctl-prometheus-key \
-n certctl \
--from-literal=api-key="$(jq -r '.key_value' /tmp/prom-key-new.json)" \
--dry-run=client -o yaml | kubectl apply -f -
# 3. Wait one scrape interval; verify the next scrape uses the new key.
# 4. Revoke the old key
curl -sS --cacert ./ca.crt -X DELETE \
-H "Authorization: Bearer ${ADMIN_API_KEY}" \
https://certctl.your-org.example/api/v1/auth/apikeys/<OLD_KEY_ID>
# 5. Scrub the temp file
shred -u /tmp/prom-key-new.json
Prometheus Operator picks up Secret changes automatically — no ServiceMonitor edit needed, no Prometheus restart.
Related reading
docs/operator/rbac.md— the full RBAC primitive, permission catalogue, and role-assignment workflow.docs/operator/security.md— the broader auth posture including the API key / OIDC / break-glass paths.docs/operator/auth-threat-model.md— why/api/v1/metrics/prometheusis gated, and what an unauthenticated leak of metrics data would reveal.