deploy(helm): close Phase 4 — chart surface + DR + ops runbooks

Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.

DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
  Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl
  matching the canonical shape in
  docs/operator/runbooks/postgres-backup.md (so manual and
  automated dumps are byte-identical). Sink: PVC (default) OR S3
  via aws-cli. Documented as in-cluster-Postgres only — managed DB
  deployments rely on their provider's PITR.

DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
  deploy/helm/certctl/templates/migration-job.yaml — runs
  `certctl-server --migrate-only` before the server Deployment
  rolls. The --migrate-only flag (new in cmd/server/main.go) is a
  hermetic schema-mutation pass: load config, open DB pool, run
  RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
  no signing setup.

  Server's boot-time RunMigrations call is now gated on
  CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
  the boot path (the hook owns the work). Default still runs at
  boot, so Compose / VM / bare-metal deploys are unchanged.

  migrations.viaHook: false in values.yaml (off by default).

DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
  deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
    spec.updateStrategy.type: OnDelete
    spec.podManagementPolicy: OrderedReady
  Operator-controlled Postgres upgrades (the OnDelete strategy
  means a chart template tweak no longer triggers an immediate
  Postgres restart). OrderedReady aligns with the standard
  Postgres-on-Kubernetes pattern for any future HA work.

DEPL-M5 (Med) — per-fleet-size resource ladder documentation
  deploy/helm/certctl/values.yaml — extended comments next to
  server.resources + agent.resources documenting:
    "≤ 500 certs / 100 agents" → defaults are validated
    "5K certs / 1K agents" → starter suggestions, TBD Phase 8
    "50K certs / 10K agents" → starter suggestions, TBD Phase 8
  Numbers for the small-fleet case derive from the measured
  baselines in docs/operator/performance-baselines.md
  (50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
  fleet numbers explicitly marked TBD pending Phase 8 load-test
  runs — operators tune empirically until then.

DEPL-L1 (Low) — Helm rollback runbook
  docs/operator/runbooks/rollback.md — covers helm rollback
  mechanics, the schema-migration manual-cleanup path (when
  *.down.sql files apply vs. when full restore is the only safe
  path), and the per-migration-class safe-to-rollback table.

DEPL-L2 (Low) — Prometheus AlertManager rules
  deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
  monitoring.prometheusRules.enabled=true. Default OFF. Four
  starter rules using verified metric names from
  internal/api/handler/metrics.go:
    CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
    CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
    CertctlJobFailureRateHigh (failure rate over 5% for 15m)
    CertctlIssuanceFailures (any failures over 15m window)
  All thresholds operator-tunable via
  monitoring.prometheusRules.thresholds.* in values.

DEPL-L3 (Low) — Prometheus bearer-token setup runbook
  docs/operator/runbooks/prometheus-bearer-token.md — documents
  the API-key + Secret + values wiring for the RBAC-gated
  /api/v1/metrics/prometheus scrape endpoint. End-to-end
  procedure with troubleshooting steps + rotation guide.

CI guard: scripts/ci-guards/helm-templates-lint.sh
  Six-combo matrix: defaults / backup PVC / backup S3 /
  prometheusRules / migrations.viaHook / all-on. Each runs helm
  template + checks render success. helm lint also gated.
  Wired into the auto-pickup loop in .github/workflows/ci.yml;
  azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
  RED-2) installs helm v3.16.0 on the runner.

Verification (all pass):
  ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
  grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml  # 2 matches
  helm template deploy/helm/certctl/ --set backup.enabled=true \
    --set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
    | grep -E "kind: (CronJob|PrometheusRule|Job)"  # 3 matches
  helm lint deploy/helm/certctl/  # 0 failed
  ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
  bash scripts/ci-guards/helm-templates-lint.sh  # 6/6 matrix combinations pass

Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
This commit is contained in:
shankar0123
2026-05-14 00:58:00 +00:00
parent b2284ef2a4
commit d6f4d5c5e8
10 changed files with 1223 additions and 11 deletions
@@ -0,0 +1,178 @@
{{- /*
Phase 4 DEPL-H2 closure (2026-05-14): opt-in Helm CronJob for
PostgreSQL backups.
OPERATOR OPT-IN. Default `backup.enabled: false`. Turning it on
requires:
- In-cluster Postgres (this CronJob does NOT cover managed DB
services — for AWS RDS / GCP CloudSQL / Azure DB rely on the
provider's PITR).
- A sink choice (PVC or S3) configured in values.yaml.
- For S3: a Secret holding AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY
(or use a service account with IRSA on EKS).
The pg_dump invocation matches the canonical shape documented in
docs/operator/runbooks/postgres-backup.md so a manual run and a
CronJob run produce byte-identical dumps:
pg_dump --format=custom --no-owner --no-acl --dbname=certctl
For sink choices beyond PVC + S3 (GCS, Azure Blob, NFS, restic, etc.),
extend the `aws s3 cp` line below. The Job is intentionally minimal —
it does ONE thing (capture + ship), not orchestrate retention or
rotation. Off-host retention is the sink's responsibility (S3 lifecycle
rules, PVC snapshot retention on the storage class, etc.).
*/ -}}
{{- if .Values.backup.enabled }}
apiVersion: batch/v1
kind: CronJob
metadata:
name: {{ include "certctl.fullname" . }}-postgres-backup
labels:
{{- include "certctl.labels" . | nindent 4 }}
app.kubernetes.io/component: postgres-backup
spec:
schedule: {{ .Values.backup.schedule | quote }}
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: {{ .Values.backup.successfulJobsHistoryLimit | default 3 }}
failedJobsHistoryLimit: {{ .Values.backup.failedJobsHistoryLimit | default 1 }}
startingDeadlineSeconds: {{ .Values.backup.startingDeadlineSeconds | default 300 }}
jobTemplate:
spec:
backoffLimit: {{ .Values.backup.backoffLimit | default 1 }}
activeDeadlineSeconds: {{ .Values.backup.activeDeadlineSeconds | default 3600 }}
template:
metadata:
labels:
{{- include "certctl.labels" . | nindent 12 }}
app.kubernetes.io/component: postgres-backup
spec:
restartPolicy: Never
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 12 }}
{{- end }}
serviceAccountName: {{ include "certctl.serviceAccountName" . }}
securityContext:
runAsUser: 1000
runAsGroup: 1000
runAsNonRoot: true
fsGroup: 1000
containers:
- name: backup
image: {{ .Values.backup.image | default "postgres:16-alpine" | quote }}
imagePullPolicy: {{ .Values.backup.imagePullPolicy | default "IfNotPresent" | quote }}
env:
- name: PGHOST
value: {{ include "certctl.fullname" . }}-postgres
- name: PGPORT
value: {{ .Values.postgresql.service.port | default 5432 | quote }}
- name: PGUSER
valueFrom:
secretKeyRef:
name: {{ include "certctl.fullname" . }}-postgres
key: username
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: {{ include "certctl.fullname" . }}-postgres
key: password
- name: PGDATABASE
valueFrom:
secretKeyRef:
name: {{ include "certctl.fullname" . }}-postgres
key: database
{{- if eq (.Values.backup.sink | default "pvc") "s3" }}
# S3 sink — operator provides AWS credentials via the
# Secret referenced in backup.s3.credentialsSecret. The
# credentials need s3:PutObject + s3:ListBucket on the
# target bucket only; least-privilege per industry
# standard.
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: {{ .Values.backup.s3.credentialsSecret.name | quote }}
key: {{ .Values.backup.s3.credentialsSecret.accessKeyIdKey | default "AWS_ACCESS_KEY_ID" }}
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: {{ .Values.backup.s3.credentialsSecret.name | quote }}
key: {{ .Values.backup.s3.credentialsSecret.secretAccessKeyKey | default "AWS_SECRET_ACCESS_KEY" }}
{{- with .Values.backup.s3.region }}
- name: AWS_DEFAULT_REGION
value: {{ . | quote }}
{{- end }}
{{- end }}
command:
- /bin/sh
- -ceu
- |
# Phase 4 DEPL-H2: canonical pg_dump shape per
# docs/operator/runbooks/postgres-backup.md.
# Custom-format compressed dump, no ownership /
# ACL embedded — produces a portable artifact
# restorable into any Postgres ≥ source major
# via `pg_restore -d certctl <dump>`.
set -euo pipefail
TIMESTAMP="$(date -u +%Y%m%dT%H%M%SZ)"
DUMP_FILE="/tmp/certctl-${TIMESTAMP}.dump"
echo "[backup-cronjob] capturing dump at ${TIMESTAMP}"
pg_dump --format=custom --no-owner --no-acl --dbname="${PGDATABASE}" \
> "${DUMP_FILE}"
# Integrity check — pg_restore --list parses the
# dump's table-of-contents; a corrupt dump fails
# here without shipping garbage off-host. Same
# check the manual runbook performs.
echo "[backup-cronjob] verifying dump integrity"
pg_restore --list "${DUMP_FILE}" > /dev/null
{{- if eq (.Values.backup.sink | default "pvc") "s3" }}
# S3 sink — requires aws-cli. The default
# postgres:16-alpine image does NOT include
# aws-cli; operators MUST set
# backup.image to an image that bundles both
# (e.g. ghcr.io/your-org/postgres-aws:16) OR
# override backup.command to install aws-cli at
# runtime. The line below assumes the image has
# `aws` on PATH.
S3_PATH="{{ .Values.backup.s3.bucket }}/{{ .Values.backup.s3.prefix | default "certctl" }}/certctl-${TIMESTAMP}.dump"
echo "[backup-cronjob] uploading to s3://${S3_PATH}"
aws s3 cp "${DUMP_FILE}" "s3://${S3_PATH}"
rm -f "${DUMP_FILE}"
{{- else }}
# PVC sink — dump lands at /backups/certctl-${TIMESTAMP}.dump
# mounted from backup.pvc.claimName. Retention is the
# PVC's responsibility (storage-class snapshot lifecycle
# or a separate cleanup CronJob). The Job moves the
# file from /tmp to /backups atomically; never
# writes partial dumps into the durable mount.
FINAL_PATH="/backups/certctl-${TIMESTAMP}.dump"
echo "[backup-cronjob] persisting to ${FINAL_PATH}"
mv "${DUMP_FILE}" "${FINAL_PATH}"
{{- end }}
echo "[backup-cronjob] done"
{{- if ne (.Values.backup.sink | default "pvc") "s3" }}
volumeMounts:
- name: backups
mountPath: /backups
{{- end }}
resources:
{{- toYaml (.Values.backup.resources | default dict) | nindent 16 }}
{{- if ne (.Values.backup.sink | default "pvc") "s3" }}
volumes:
- name: backups
persistentVolumeClaim:
claimName: {{ .Values.backup.pvc.claimName | quote }}
{{- end }}
{{- with .Values.nodeAffinity }}
affinity:
nodeAffinity:
{{- toYaml . | nindent 14 }}
{{- end }}
{{- with .Values.backup.tolerations }}
tolerations:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- end }}
@@ -0,0 +1,89 @@
{{- /*
Phase 4 DEPL-M1 closure (2026-05-14): Helm pre-install / pre-upgrade
hook that runs Postgres migrations before the server Deployment rolls.
Pre-DEPL-M1, postgres.RunMigrations was invoked at server boot
(cmd/server/main.go:151) as the only migration path. That works for
Compose deployments but conflicts with Kubernetes rolling deploys:
when a new server image lands with a schema change, multiple replicas
race the migration during the rollout. The hook resolves the race by
running migrations OUT OF BAND, exactly once, before any new server
pod starts.
How it works:
- The Job ships the same certctl-server image as the Deployment, so
the migration code path is binary-identical to the boot-time path.
- It runs `certctl-server --migrate-only` (a flag the cmd/server
main process must support — see cmd/server/main.go for the flag
parse + early-exit path).
- The CERTCTL_MIGRATIONS_VIA_HOOK=true env var is ALSO set on the
server Deployment (via values.yaml). When the server boots, it
sees this env var and skips its own RunMigrations call — the
hook already did the work. Compose deploys don't set the env
var, so they keep the boot-time path unchanged.
- hook-delete-policy hook-succeeded means the Job is cleaned up
automatically on success but retained on failure for operator
diagnosis.
- The hook-weight ensures the migration Job runs before any other
pre-install/pre-upgrade resources (the StatefulSet's PVC has to
exist first; in practice the StatefulSet has no hook so it lands
naturally in the install phase after the Job completes).
Operators on Compose: this hook is a no-op for you. The server still
runs migrations at boot per the existing path.
*/ -}}
{{- if .Values.migrations.viaHook }}
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "certctl.fullname" . }}-migrate
labels:
{{- include "certctl.labels" . | nindent 4 }}
app.kubernetes.io/component: migration
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": hook-succeeded,before-hook-creation
spec:
backoffLimit: {{ .Values.migrations.backoffLimit | default 1 }}
activeDeadlineSeconds: {{ .Values.migrations.activeDeadlineSeconds | default 600 }}
template:
metadata:
labels:
{{- include "certctl.labels" . | nindent 8 }}
app.kubernetes.io/component: migration
spec:
restartPolicy: Never
serviceAccountName: {{ include "certctl.serviceAccountName" . }}
securityContext:
{{- include "certctl.podSecurityContext" .Values.server.securityContext | nindent 8 }}
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
containers:
- name: migrate
image: {{ include "certctl.serverImage" . }}
imagePullPolicy: {{ .Values.server.image.pullPolicy }}
# Migration-only entrypoint. The server binary supports a
# --migrate-only flag that runs postgres.RunMigrations +
# postgres.RunSeed and exits cleanly (zero on success,
# non-zero on migration failure). See cmd/server/main.go
# for the implementation. The flag is hermetic — no HTTP
# listener starts, no scheduler ticks, no signing
# operations occur. Pure schema-mutation pass.
command:
- /app/server
- --migrate-only
env:
- name: CERTCTL_DATABASE_URL
value: {{ include "certctl.databaseURL" . | quote }}
- name: CERTCTL_LOG_LEVEL
value: {{ .Values.server.logging.level | default "info" | quote }}
- name: CERTCTL_LOG_FORMAT
value: {{ .Values.server.logging.format | default "json" | quote }}
resources:
{{- toYaml (.Values.migrations.resources | default .Values.server.resources) | nindent 12 }}
securityContext:
{{- include "certctl.containerSecurityContext" .Values.server.securityContext | nindent 12 }}
{{- end }}
@@ -9,6 +9,21 @@ metadata:
spec:
serviceName: {{ include "certctl.fullname" . }}-postgres
replicas: 1
# Phase 4 DEPL-M4 closure (2026-05-14): explicit StatefulSet update +
# pod-management strategies. Defaults make Postgres upgrades
# operator-controlled rather than automatic:
# updateStrategy.type: OnDelete — Postgres pods do NOT roll
# automatically when the StatefulSet spec changes. Operator
# deletes the pod explicitly after taking a backup + reviewing
# the change. Prevents an accidental Helm-template tweak from
# triggering a database restart at an awkward time.
# podManagementPolicy: OrderedReady — when scaling Postgres to
# a replica >1 (future HA work), pods come up one at a time
# and must reach Ready before the next pod is created. Aligns
# with the standard Postgres-on-Kubernetes pattern.
updateStrategy:
type: OnDelete
podManagementPolicy: OrderedReady
selector:
matchLabels:
{{- include "certctl.postgresSelectorLabels" . | nindent 6 }}
@@ -0,0 +1,145 @@
{{- /*
Phase 4 DEPL-L2 closure (2026-05-14): opt-in Prometheus AlertManager
rules covering the four operationally-actionable alerts every certctl
deployment wants out of the box.
OPERATOR OPT-IN. Default `monitoring.prometheusRules.enabled: false`.
Turning it on requires Prometheus Operator CRDs (PrometheusRule kind)
to be installed in-cluster. Without them this template renders an
object Kubernetes will reject — keep the toggle off if you're scraping
with vanilla Prometheus + a Helm-installed AlertManager rules
ConfigMap instead.
Metric names + thresholds verified against the actual
internal/api/handler/metrics.go exposition path:
- certctl_certificate_expiring_soon: server-side count of certs with
ExpiresAt in (now, now + 30d]. The 30-day window is computed in
internal/service/stats.go::GetDashboardSummary.
- certctl_agent_online: agents with heartbeat in the last 5 minutes.
A drop below certctl_agent_total signals offline agents.
- certctl_job_failed_total + certctl_job_completed_total: cumulative
counters; ratio gives the failure rate over the rate() window.
- certctl_issuance_failures_total: cumulative counter of failed
issuance attempts (renewal failures are issuance failures with a
specific error_class label).
Adjust thresholds per fleet — the defaults below are tuned for the
demo dataset (15 certs / 1 agent) and may need raising for production
fleets with thousands of certs where a steady rate of expiring certs
is the normal operating state.
*/ -}}
{{- if and .Values.monitoring.enabled .Values.monitoring.prometheusRules.enabled }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: {{ include "certctl.fullname" . }}-rules
labels:
{{- include "certctl.labels" . | nindent 4 }}
app.kubernetes.io/component: monitoring
{{- with .Values.monitoring.prometheusRules.labels }}
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
groups:
- name: certctl.alerts
interval: {{ .Values.monitoring.prometheusRules.interval | default "60s" }}
rules:
# ---------------------------------------------------------------
# Alert: CertctlCertificateExpiringSoon
# Series: certctl_certificate_expiring_soon
# The certctl-server counts certs with ExpiresAt in
# (now, now + 30d] every metrics scrape. Fires whenever any cert
# crosses into that window — operator must triage or extend
# automation coverage. Rapid renewal infrastructure should keep
# this number small in steady state.
# ---------------------------------------------------------------
- alert: CertctlCertificateExpiringSoon
expr: certctl_certificate_expiring_soon > {{ .Values.monitoring.prometheusRules.thresholds.expiringCertificateCount | default 0 }}
for: {{ .Values.monitoring.prometheusRules.thresholds.expiringCertificateFor | default "5m" }}
labels:
severity: warning
component: certctl
annotations:
summary: "certctl: {{`{{ $value }}`}} certificate(s) expiring within 30 days"
description: >-
certctl_certificate_expiring_soon has been > {{ .Values.monitoring.prometheusRules.thresholds.expiringCertificateCount | default 0 }}
for 5+ minutes. Investigate via
/api/v1/certificates?status=expiring or the dashboard's
Expiring tab. If renewal automation should have covered
these, check the renewal scheduler logs for the cert IDs
+ the per-issuer failure rate.
# ---------------------------------------------------------------
# Alert: CertctlAgentOffline
# Series: certctl_agent_total - certctl_agent_online
# Agents flip from online → offline after 5 minutes without a
# heartbeat (internal/service/stats.go::GetDashboardSummary).
# The 1h `for:` window prevents a flapping agent from paging the
# operator on every transient network blip.
# ---------------------------------------------------------------
- alert: CertctlAgentOffline
expr: (certctl_agent_total - certctl_agent_online) > {{ .Values.monitoring.prometheusRules.thresholds.offlineAgentCount | default 0 }}
for: {{ .Values.monitoring.prometheusRules.thresholds.offlineAgentFor | default "1h" }}
labels:
severity: warning
component: certctl-agent
annotations:
summary: "certctl: {{`{{ $value }}`}} agent(s) offline for >1h"
description: >-
One or more certctl-agent instances have been without a
heartbeat for over an hour. Check the agent logs on the
affected hosts. If the agent host is intentionally
decommissioned, retire the agent via the dashboard or
POST /api/v1/agents/{id}/retire to suppress this alert.
# ---------------------------------------------------------------
# Alert: CertctlJobFailureRateHigh
# Series: certctl_job_failed_total / (certctl_job_failed_total + certctl_job_completed_total)
# Computes the failure rate over a 15-minute rate() window so
# short bursts don't fire but a sustained issue does. The 5%
# threshold is a conservative starter — adjust per fleet's
# baseline.
# ---------------------------------------------------------------
- alert: CertctlJobFailureRateHigh
expr: >-
(
rate(certctl_job_failed_total[15m])
/
clamp_min(rate(certctl_job_failed_total[15m]) + rate(certctl_job_completed_total[15m]), 1)
) > {{ .Values.monitoring.prometheusRules.thresholds.jobFailureRate | default 0.05 }}
for: {{ .Values.monitoring.prometheusRules.thresholds.jobFailureRateFor | default "15m" }}
labels:
severity: warning
component: certctl
annotations:
summary: "certctl: job failure rate above 5% over 15m"
description: >-
The 15m rate of certctl_job_failed_total / total jobs
has been above 5% for 15+ minutes. Open
/api/v1/jobs?status=failed to see the failing job IDs
and root-cause the recurring error class.
# ---------------------------------------------------------------
# Alert: CertctlIssuanceFailures
# Series: certctl_issuance_failures_total
# Any non-zero rate of issuance failures over a 15m window is
# operationally significant — a single CA outage or expired
# ACME account can cascade across the fleet.
# ---------------------------------------------------------------
- alert: CertctlIssuanceFailures
expr: rate(certctl_issuance_failures_total[15m]) > {{ .Values.monitoring.prometheusRules.thresholds.issuanceFailureRate | default 0 }}
for: {{ .Values.monitoring.prometheusRules.thresholds.issuanceFailureFor | default "15m" }}
labels:
severity: warning
component: certctl
annotations:
summary: "certctl: certificate issuance / renewal failures over 15m"
description: >-
certctl_issuance_failures_total has been incrementing
over the last 15 minutes. Check the per-issuer breakdown
via /api/v1/issuers + the failed-job log in
/api/v1/jobs?status=failed. Common causes: CA
outage, ACME account rate-limit, EAB credential
expiration, stepca provisioner key rotation without
certctl-side update.
{{- end }}
+193
View File
@@ -31,6 +31,36 @@ server:
port: 8443
# Resource requests and limits
#
# Phase 4 DEPL-M5 (2026-05-14): per-fleet-size tuning ladder. The
# default values below are validated against the demo dataset
# (15 certs / 1 agent) and the baselines in
# docs/operator/performance-baselines.md (single endpoint < 5s for
# 100 sequential requests = ~50ms p50; cursor-paginated 1000-cert
# inventory walk < 3s; renewal scan for 15 certs < 100ms).
#
# Larger fleet recommendations (TBD pending Phase 8 load-test runs;
# operators tune empirically until then — capture readings in your
# own loadtest-baselines log):
#
# ≤ 500 certs / 100 agents: defaults below (100m / 128Mi req, 500m / 512Mi lim)
# 5K certs / 1K agents: tune up — TBD Phase 8 (suggested starter: 500m / 512Mi req, 2000m / 2Gi lim)
# 50K certs / 10K agents: tune up — TBD Phase 8 (suggested starter: 2000m / 2Gi req, 4000m / 4Gi lim)
#
# The "suggested starter" values above are operator-tuning starting
# points, NOT validated. Phase 8 (load test coverage expansion) will
# measure them against synthetic fleets and replace the suggestions
# with measured ceilings. Until then, treat them as a "raise CPU
# before raising memory; raise both before scaling out" mental
# model. Per docs/operator/performance-baselines.md, certctl-server
# is CPU-bound on issuance / renewal scan work and memory-bound on
# the inventory query path.
#
# Database scale (postgresql.* below) tracks server scale roughly
# 1:1 — at 50K certs the Postgres instance needs 4 CPU / 4Gi RAM
# and shared_buffers ≥ 1Gi. Postgres tuning is out of scope for
# this comment; see docs/operator/runbooks/postgres-backup.md
# for the production-tuning entry-point.
resources:
requests:
cpu: 100m
@@ -449,6 +479,26 @@ agent:
replicas: 1
# Resource requests and limits
#
# Phase 4 DEPL-M5 (2026-05-14): per-fleet-size tuning ladder for the
# agent. Defaults are sized for the standard "one cert per host"
# operating pattern: the agent polls the server every 60s (default
# CERTCTL_AGENT_POLL_INTERVAL), generates ECDSA P-256 keys locally on
# issuance/renewal events, and is otherwise idle. CPU is bursty only
# during keygen + CSR submission.
#
# Tuning ladder (TBD pending Phase 8 — measure on your fleet):
#
# 1 cert / host (typical): defaults below (50m / 64Mi req, 200m / 256Mi lim)
# 10 certs / host: stays at defaults — agent is poll-driven, not work-bound by cert count
# 100 certs / host (rare): raise lim to 500m / 512Mi if you see throttling on issuance bursts
#
# The agent does NOT cache certs in memory — issuance is one-shot
# generate-then-deploy. So per-host memory scales with whatever
# truststore PEM bundles the agent's connectors load (Apache /
# Postfix / similar), not with the cert count. Defaults are
# appropriate for any "agent terminates ≤ 100 certs on this host"
# deployment.
resources:
requests:
cpu: 50m
@@ -612,6 +662,149 @@ monitoring:
# Optional relabeling for the scrape job.
# relabelings: []
# ----------------------------------------------------------------------
# Phase 4 DEPL-L2 closure (2026-05-14): PrometheusRule (alert rules)
#
# Operator opt-in. Requires Prometheus Operator CRDs (the
# `monitoring.coreos.com/v1` PrometheusRule kind) installed in
# cluster. Without those CRDs the rendered object is rejected by
# `kubectl apply` — keep enabled: false if you scrape with vanilla
# Prometheus + AlertManager rules ConfigMap instead.
#
# Four starter rules ship out of the box (see
# templates/prometheusrules.yaml for the full PromQL):
#
# CertctlCertificateExpiringSoon — certs expiring within 30d
# CertctlAgentOffline — agent without heartbeat for >1h
# CertctlJobFailureRateHigh — job-failure rate over 5% (15m)
# CertctlIssuanceFailures — any issuance failures in last 15m
#
# All thresholds are operator-tunable via the `thresholds:` block
# below. The defaults are tuned for the demo dataset (15 certs / 1
# agent); production fleets with sustained renewal volume MAY want
# to raise the expiringCertificateCount + jobFailureRate thresholds
# to suppress steady-state noise.
prometheusRules:
enabled: false
# Evaluation interval for the rule group.
interval: 60s
# Additional labels applied to the PrometheusRule metadata.
# labels: {}
# Per-alert threshold / duration tunables.
thresholds:
# Fire when more than N certs are in the expiring-soon window.
expiringCertificateCount: 0
expiringCertificateFor: 5m
# Fire when more than N agents are offline (server - online).
offlineAgentCount: 0
offlineAgentFor: 1h
# Fire when job failure rate exceeds this fraction (15m window).
jobFailureRate: 0.05
jobFailureRateFor: 15m
# Fire when issuance failure rate exceeds this value (15m window).
issuanceFailureRate: 0
issuanceFailureFor: 15m
# ==============================================================================
# Backup CronJob (Phase 4 DEPL-H2 closure, 2026-05-14)
# ==============================================================================
# Operator opt-in. Default OFF. The CronJob runs `pg_dump --format=custom
# --no-owner --no-acl --dbname=certctl` matching the canonical shape
# documented in docs/operator/runbooks/postgres-backup.md (so manual
# and automated dumps are byte-identical) and ships the result to a
# sink chosen below.
#
# DO NOT enable this for managed Postgres deployments (AWS RDS / GCP
# Cloud SQL / Azure DB) — those have built-in PITR backup that this
# CronJob cannot match. For in-cluster Postgres only.
backup:
enabled: false
# Cron expression (UTC). Default: 02:30 UTC daily.
schedule: "30 2 * * *"
# Sink: "pvc" (default — dump lands on a PersistentVolumeClaim) or
# "s3" (uploads via aws-cli — requires an image that bundles
# aws-cli, see backup.image below).
sink: pvc
# Container image. The default postgres:16-alpine has pg_dump but
# NOT aws-cli; for sink: s3 set this to an image that bundles both
# (e.g. ghcr.io/your-org/postgres-aws:16) or override the Job's
# command to install aws-cli at runtime.
image: postgres:16-alpine
imagePullPolicy: IfNotPresent
# PVC sink config — used when sink: pvc.
pvc:
# Name of an existing PersistentVolumeClaim mounted at /backups
# in the Job's pod. The PVC's storage class controls durability
# and snapshot retention. Operator creates this PVC out of band
# via their own storage policy.
claimName: certctl-backups
# S3 sink config — used when sink: s3.
s3:
# Target bucket (without s3:// prefix).
bucket: ""
# Object key prefix inside the bucket. Dumps land at
# s3://<bucket>/<prefix>/certctl-<TIMESTAMP>.dump.
prefix: certctl
# AWS region (sets AWS_DEFAULT_REGION). Optional if the image's
# AWS SDK can resolve the region another way (instance profile,
# IRSA, etc.).
region: ""
# Secret holding AWS credentials. The IAM principal needs
# s3:PutObject + s3:ListBucket on the target bucket only.
credentialsSecret:
name: certctl-backup-aws-creds
accessKeyIdKey: AWS_ACCESS_KEY_ID
secretAccessKeyKey: AWS_SECRET_ACCESS_KEY
# Job housekeeping.
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
startingDeadlineSeconds: 300
backoffLimit: 1
activeDeadlineSeconds: 3600
# Resource budget for the backup container. pg_dump is generally
# memory-light; ~250MB RSS for fleets up to 100K certs is typical.
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# Optional tolerations for the backup Job pod.
tolerations: []
# ==============================================================================
# Migrations via Helm hook (Phase 4 DEPL-M1 closure, 2026-05-14)
# ==============================================================================
# When viaHook: true, the chart deploys templates/migration-job.yaml as
# a pre-install + pre-upgrade hook that runs `certctl-server
# --migrate-only` (a hermetic schema-mutation pass) before the server
# Deployment rolls.
#
# Set CERTCTL_MIGRATIONS_VIA_HOOK=true in the server Deployment env to
# tell the server to skip its boot-time RunMigrations call (the hook
# already did the work; running again at boot would race across
# replicas during rollouts).
#
# Default OFF — when off, the server runs migrations at boot exactly
# as it always has (Compose deploys keep this path).
migrations:
viaHook: false
# Job housekeeping.
backoffLimit: 1
activeDeadlineSeconds: 600
# Resource budget for the migration Job pod. The migration pass is
# I/O-bound on Postgres; matches the server's resource budget by
# default. Override here if migrations on a large database need
# more headroom than the steady-state server.
# resources:
# requests:
# cpu: 100m
# memory: 128Mi
# limits:
# cpu: 500m
# memory: 512Mi
# ==============================================================================
# Network Policy (Bundle 3 closure / D11)
# ==============================================================================