mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 15:51:30 +00:00

Files

T

shankar0123 d6f4d5c5e8 deploy(helm): close Phase 4 — chart surface + DR + ops runbooks

Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.

DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
  Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl
  matching the canonical shape in
  docs/operator/runbooks/postgres-backup.md (so manual and
  automated dumps are byte-identical). Sink: PVC (default) OR S3
  via aws-cli. Documented as in-cluster-Postgres only — managed DB
  deployments rely on their provider's PITR.

DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
  deploy/helm/certctl/templates/migration-job.yaml — runs
  `certctl-server --migrate-only` before the server Deployment
  rolls. The --migrate-only flag (new in cmd/server/main.go) is a
  hermetic schema-mutation pass: load config, open DB pool, run
  RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
  no signing setup.

  Server's boot-time RunMigrations call is now gated on
  CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
  the boot path (the hook owns the work). Default still runs at
  boot, so Compose / VM / bare-metal deploys are unchanged.

  migrations.viaHook: false in values.yaml (off by default).

DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
  deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
    spec.updateStrategy.type: OnDelete
    spec.podManagementPolicy: OrderedReady
  Operator-controlled Postgres upgrades (the OnDelete strategy
  means a chart template tweak no longer triggers an immediate
  Postgres restart). OrderedReady aligns with the standard
  Postgres-on-Kubernetes pattern for any future HA work.

DEPL-M5 (Med) — per-fleet-size resource ladder documentation
  deploy/helm/certctl/values.yaml — extended comments next to
  server.resources + agent.resources documenting:
    "≤ 500 certs / 100 agents" → defaults are validated
    "5K certs / 1K agents" → starter suggestions, TBD Phase 8
    "50K certs / 10K agents" → starter suggestions, TBD Phase 8
  Numbers for the small-fleet case derive from the measured
  baselines in docs/operator/performance-baselines.md
  (50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
  fleet numbers explicitly marked TBD pending Phase 8 load-test
  runs — operators tune empirically until then.

DEPL-L1 (Low) — Helm rollback runbook
  docs/operator/runbooks/rollback.md — covers helm rollback
  mechanics, the schema-migration manual-cleanup path (when
  *.down.sql files apply vs. when full restore is the only safe
  path), and the per-migration-class safe-to-rollback table.

DEPL-L2 (Low) — Prometheus AlertManager rules
  deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
  monitoring.prometheusRules.enabled=true. Default OFF. Four
  starter rules using verified metric names from
  internal/api/handler/metrics.go:
    CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
    CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
    CertctlJobFailureRateHigh (failure rate over 5% for 15m)
    CertctlIssuanceFailures (any failures over 15m window)
  All thresholds operator-tunable via
  monitoring.prometheusRules.thresholds.* in values.

DEPL-L3 (Low) — Prometheus bearer-token setup runbook
  docs/operator/runbooks/prometheus-bearer-token.md — documents
  the API-key + Secret + values wiring for the RBAC-gated
  /api/v1/metrics/prometheus scrape endpoint. End-to-end
  procedure with troubleshooting steps + rotation guide.

CI guard: scripts/ci-guards/helm-templates-lint.sh
  Six-combo matrix: defaults / backup PVC / backup S3 /
  prometheusRules / migrations.viaHook / all-on. Each runs helm
  template + checks render success. helm lint also gated.
  Wired into the auto-pickup loop in .github/workflows/ci.yml;
  azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
  RED-2) installs helm v3.16.0 on the runner.

Verification (all pass):
  ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
  grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml  # 2 matches
  helm template deploy/helm/certctl/ --set backup.enabled=true \
    --set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
    | grep -E "kind: (CronJob|PrometheusRule|Job)"  # 3 matches
  helm lint deploy/helm/certctl/  # 0 failed
  ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
  bash scripts/ci-guards/helm-templates-lint.sh  # 6/6 matrix combinations pass

Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3

2026-05-14 00:58:00 +00:00

8.1 KiB

Raw Blame History

Runbook: Prometheus bearer token for the metrics scrape endpoint

Last reviewed: 2026-05-14

Use this when:

You're enabling Prometheus Operator scraping via the Helm chart's monitoring.serviceMonitor.enabled toggle.
Your Prometheus scrapes are returning 401 against /api/v1/metrics/prometheus.
An auditor asks "how is the metrics endpoint authenticated?"

The constraint

The certctl server exposes Prometheus metrics at /api/v1/metrics/prometheus. This endpoint is RBAC-gated on the metrics.read permission (per internal/api/router/router.go). Like every other gated handler, it requires an authenticated actor holding that permission — there is no anonymous-scrape path.

The rationale: the metrics payload includes operational counters (cert counts by status, agent counts, issuance failure rates) that a public-facing observer should not see. Most certctl deployments expose a reverse proxy / load balancer to the wider network; the auth gate on /api/v1/metrics/prometheus prevents an external observer from learning operational state via the metrics endpoint even when the proxy itself is reachable.

What you need to set up

Three pieces:

An API key with metrics.read permission (and only that permission — least-privilege).
A Kubernetes Secret holding that API key.
monitoring.serviceMonitor.bearerTokenSecret in the chart's values pointing at the Secret.

Step 1: Create the metrics-read role + API key

The chart's seed migration ships a metrics-read role-template, but some operators want a dedicated identity per scrape source. Both approaches work; the dedicated-identity path is below.

# 1. Bootstrap or impersonate a session with auth.role.assign +
#    auth.apikey.create permissions (admin actor is fine).

# 2. Create a role with only metrics.read.
curl -sS --cacert ./ca.crt -X POST \
  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
  -H "Content-Type: application/json" \
  https://certctl.your-org.example/api/v1/auth/roles \
  -d '{"id":"r-prometheus-scrape","name":"Prometheus scrape","permissions":["metrics.read"]}'

# 3. Create an actor that holds the role.
curl -sS --cacert ./ca.crt -X POST \
  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
  -H "Content-Type: application/json" \
  https://certctl.your-org.example/api/v1/auth/actors \
  -d '{"id":"actor-prometheus","name":"Prometheus scrape","roles":["r-prometheus-scrape"]}'

# 4. Mint an API key for the actor. The response includes a
#    `key_value` field that's only returned ONCE — capture it.
curl -sS --cacert ./ca.crt -X POST \
  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
  -H "Content-Type: application/json" \
  https://certctl.your-org.example/api/v1/auth/apikeys \
  -d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token"}' \
  | tee /tmp/prom-key.json

# Extract just the secret material:
jq -r '.key_value' /tmp/prom-key.json

The mint endpoint returns the API key plaintext exactly once. The server stores only a constant-time-comparable hash; if you lose the key value, mint a new one.

Step 2: Create the Kubernetes Secret

NAMESPACE=certctl
API_KEY=$(jq -r '.key_value' /tmp/prom-key.json)

kubectl create secret generic certctl-prometheus-key \
  -n "$NAMESPACE" \
  --from-literal=api-key="$API_KEY"

Now scrub the temporary file:

shred -u /tmp/prom-key.json

Step 3: Wire the Secret into the chart values

In your values.yaml (or --set overrides):

monitoring:
  enabled: true
  serviceMonitor:
    enabled: true
    interval: 30s
    scrapeTimeout: 10s
    bearerTokenSecret:
      name: certctl-prometheus-key
      key: api-key

Re-apply the chart:

helm upgrade certctl . -n "$NAMESPACE" --reuse-values

The rendered ServiceMonitor will now include the bearerTokenSecret block. Prometheus Operator's reconciler picks it up and injects the bearer token into the scrape request.

Verification

# 1. Confirm the ServiceMonitor renders with the secret reference
kubectl get servicemonitor -n "$NAMESPACE" certctl-server -o yaml \
  | grep -A2 bearerTokenSecret

# Expected:
#       bearerTokenSecret:
#         name: certctl-prometheus-key
#         key: api-key

# 2. Tail the certctl-server logs for the next ~60 seconds (one
#    Prometheus scrape interval). Look for incoming GET /metrics/prometheus
#    requests authenticated successfully — no 401s.
kubectl logs -n "$NAMESPACE" -l app.kubernetes.io/component=server \
  --tail=100 -f | grep -E "GET /api/v1/metrics/prometheus|metrics-scrape"

# 3. From the Prometheus UI's "Targets" page, the certctl-server
#    target should be UP and last-scrape-error empty. If it's
#    showing 401, the bearer token isn't reaching the request — see
#    troubleshooting below.

Troubleshooting

Prometheus target shows 401

Three possible causes:

Wrong Secret name / key. Run kubectl get secret -n "$NAMESPACE" certctl-prometheus-key -o yaml and confirm the data.api-key field exists with a base64-encoded non-empty value. The Secret's data field name must match the bearerTokenSecret.key value in monitoring.serviceMonitor.
API key doesn't have metrics.read. Hit the gating endpoint manually from inside the cluster with the same key:
```
kubectl run --rm -it --image=curlimages/curl debug -- \
  curl -sS -H "Authorization: Bearer <API_KEY>" \
  https://certctl-server.certctl.svc.cluster.local:8443/api/v1/metrics/prometheus
```
A 401 here means the role doesn't include metrics.read. A 403 means the role exists but the API key isn't assigned to it.
TLS verification failure (not a 401, but masquerading as one in Prometheus's logs). The default ServiceMonitor template sets insecureSkipVerify: true to support demos — production deploys should set tlsConfig.caFile or tlsConfig.ca.secret per the ServiceMonitor docs.

Prometheus target shows TLS errors

monitoring.serviceMonitor.tlsConfig overrides the default. Three patterns:

# Pattern 1: trust the system CA bundle (production behind a real CA)
tlsConfig:
  caFile: /etc/ssl/certs/ca-certificates.crt
  serverName: certctl.your-org.example

# Pattern 2: trust a CA from a Secret mounted by Prometheus Operator
tlsConfig:
  ca:
    secret:
      name: certctl-ca
      key: ca.crt
  serverName: certctl.your-org.example

# Pattern 3: skip verification (DEMO ONLY — DO NOT USE IN PRODUCTION)
tlsConfig:
  insecureSkipVerify: true

The certctl server's self-signed bootstrap cert (default server.tls.existingSecret from the chart) presents a CN of certctl-server. If your serverName doesn't match, the scrape fails with x509: certificate is valid for certctl-server, not ....

Rotation

API keys are constant-time-compared, stored hashed, and never logged. Rotation:

# 1. Mint a new key (same actor + role)
curl -sS --cacert ./ca.crt -X POST \
  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
  -H "Content-Type: application/json" \
  https://certctl.your-org.example/api/v1/auth/apikeys \
  -d '{"actor_id":"actor-prometheus","name":"prometheus-scrape-token-v2"}' \
  | tee /tmp/prom-key-new.json

# 2. Update the Secret in place
kubectl create secret generic certctl-prometheus-key \
  -n certctl \
  --from-literal=api-key="$(jq -r '.key_value' /tmp/prom-key-new.json)" \
  --dry-run=client -o yaml | kubectl apply -f -

# 3. Wait one scrape interval; verify the next scrape uses the new key.

# 4. Revoke the old key
curl -sS --cacert ./ca.crt -X DELETE \
  -H "Authorization: Bearer ${ADMIN_API_KEY}" \
  https://certctl.your-org.example/api/v1/auth/apikeys/<OLD_KEY_ID>

# 5. Scrub the temp file
shred -u /tmp/prom-key-new.json

Prometheus Operator picks up Secret changes automatically — no ServiceMonitor edit needed, no Prometheus restart.

docs/operator/rbac.md — the full RBAC primitive, permission catalogue, and role-assignment workflow.
docs/operator/security.md — the broader auth posture including the API key / OIDC / break-glass paths.
docs/operator/auth-threat-model.md — why /api/v1/metrics/prometheus is gated, and what an unauthenticated leak of metrics data would reveal.

8.1 KiB Raw Blame History