mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 12:21:31 +00:00
deploy(helm): close Phase 4 — chart surface + DR + ops runbooks
Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.
DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
pg_dump --format=custom --no-owner --no-acl --dbname=certctl
matching the canonical shape in
docs/operator/runbooks/postgres-backup.md (so manual and
automated dumps are byte-identical). Sink: PVC (default) OR S3
via aws-cli. Documented as in-cluster-Postgres only — managed DB
deployments rely on their provider's PITR.
DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
deploy/helm/certctl/templates/migration-job.yaml — runs
`certctl-server --migrate-only` before the server Deployment
rolls. The --migrate-only flag (new in cmd/server/main.go) is a
hermetic schema-mutation pass: load config, open DB pool, run
RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
no signing setup.
Server's boot-time RunMigrations call is now gated on
CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
the boot path (the hook owns the work). Default still runs at
boot, so Compose / VM / bare-metal deploys are unchanged.
migrations.viaHook: false in values.yaml (off by default).
DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
spec.updateStrategy.type: OnDelete
spec.podManagementPolicy: OrderedReady
Operator-controlled Postgres upgrades (the OnDelete strategy
means a chart template tweak no longer triggers an immediate
Postgres restart). OrderedReady aligns with the standard
Postgres-on-Kubernetes pattern for any future HA work.
DEPL-M5 (Med) — per-fleet-size resource ladder documentation
deploy/helm/certctl/values.yaml — extended comments next to
server.resources + agent.resources documenting:
"≤ 500 certs / 100 agents" → defaults are validated
"5K certs / 1K agents" → starter suggestions, TBD Phase 8
"50K certs / 10K agents" → starter suggestions, TBD Phase 8
Numbers for the small-fleet case derive from the measured
baselines in docs/operator/performance-baselines.md
(50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
fleet numbers explicitly marked TBD pending Phase 8 load-test
runs — operators tune empirically until then.
DEPL-L1 (Low) — Helm rollback runbook
docs/operator/runbooks/rollback.md — covers helm rollback
mechanics, the schema-migration manual-cleanup path (when
*.down.sql files apply vs. when full restore is the only safe
path), and the per-migration-class safe-to-rollback table.
DEPL-L2 (Low) — Prometheus AlertManager rules
deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
monitoring.prometheusRules.enabled=true. Default OFF. Four
starter rules using verified metric names from
internal/api/handler/metrics.go:
CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
CertctlJobFailureRateHigh (failure rate over 5% for 15m)
CertctlIssuanceFailures (any failures over 15m window)
All thresholds operator-tunable via
monitoring.prometheusRules.thresholds.* in values.
DEPL-L3 (Low) — Prometheus bearer-token setup runbook
docs/operator/runbooks/prometheus-bearer-token.md — documents
the API-key + Secret + values wiring for the RBAC-gated
/api/v1/metrics/prometheus scrape endpoint. End-to-end
procedure with troubleshooting steps + rotation guide.
CI guard: scripts/ci-guards/helm-templates-lint.sh
Six-combo matrix: defaults / backup PVC / backup S3 /
prometheusRules / migrations.viaHook / all-on. Each runs helm
template + checks render success. helm lint also gated.
Wired into the auto-pickup loop in .github/workflows/ci.yml;
azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
RED-2) installs helm v3.16.0 on the runner.
Verification (all pass):
ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml # 2 matches
helm template deploy/helm/certctl/ --set backup.enabled=true \
--set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
| grep -E "kind: (CronJob|PrometheusRule|Job)" # 3 matches
helm lint deploy/helm/certctl/ # 0 failed
ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
bash scripts/ci-guards/helm-templates-lint.sh # 6/6 matrix combinations pass
Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.
Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
This commit is contained in:
+71
-11
@@ -55,6 +55,26 @@ import (
|
||||
)
|
||||
|
||||
func main() {
|
||||
// Phase 4 DEPL-M1 closure (2026-05-14): --migrate-only flag for
|
||||
// the Helm pre-install/pre-upgrade hook (see
|
||||
// deploy/helm/certctl/templates/migration-job.yaml). When set, the
|
||||
// server loads config, opens the DB pool, runs migrations + seed,
|
||||
// and exits — no HTTP listener, no scheduler, no signing work.
|
||||
// Same migration code path as boot-time RunMigrations; only the
|
||||
// surrounding lifecycle differs.
|
||||
//
|
||||
// Hand-parsed (instead of pulling in flag.Parse) because the rest
|
||||
// of the server's config surface is env-var driven via
|
||||
// config.Load(); adding a flag.Parse() with global state risks
|
||||
// conflicting with other binaries that import cmd/server later.
|
||||
migrateOnly := false
|
||||
for _, arg := range os.Args[1:] {
|
||||
if arg == "--migrate-only" {
|
||||
migrateOnly = true
|
||||
break
|
||||
}
|
||||
}
|
||||
|
||||
// Load configuration
|
||||
cfg, err := config.Load()
|
||||
if err != nil {
|
||||
@@ -146,13 +166,37 @@ func main() {
|
||||
defer db.Close()
|
||||
logger.Info("connected to database")
|
||||
|
||||
// Run migrations
|
||||
logger.Info("running migrations", "path", cfg.Database.MigrationsPath)
|
||||
if err := postgres.RunMigrations(db, cfg.Database.MigrationsPath); err != nil {
|
||||
logger.Error("failed to run migrations", "error", err)
|
||||
os.Exit(1)
|
||||
// Phase 4 DEPL-M1 closure (2026-05-14): migration-via-hook posture.
|
||||
//
|
||||
// Three lifecycles to support:
|
||||
// (a) Compose / VM / bare-metal: server runs migrations at boot.
|
||||
// Default behavior — preserved unchanged.
|
||||
// (b) Helm with pre-install/pre-upgrade hook: the migration Job
|
||||
// runs `certctl-server --migrate-only`, does its work, and
|
||||
// exits. The server Deployment's pods then start with
|
||||
// CERTCTL_MIGRATIONS_VIA_HOOK=true set; they see the env
|
||||
// var and skip their boot-time RunMigrations call so the
|
||||
// Job's work isn't duplicated.
|
||||
// (c) Bare `certctl-server --migrate-only` invocation (e.g.
|
||||
// operator running a one-shot migration from the CLI):
|
||||
// runs migrations + seed and exits cleanly. No HTTP
|
||||
// listener, no scheduler, no signing work.
|
||||
//
|
||||
// migrateOnly captures case (c); CERTCTL_MIGRATIONS_VIA_HOOK
|
||||
// captures case (b). Both paths converge on the same RunMigrations
|
||||
// + RunSeed code below.
|
||||
migrationsViaHook := strings.EqualFold(os.Getenv("CERTCTL_MIGRATIONS_VIA_HOOK"), "true")
|
||||
|
||||
if migrateOnly || !migrationsViaHook {
|
||||
logger.Info("running migrations", "path", cfg.Database.MigrationsPath)
|
||||
if err := postgres.RunMigrations(db, cfg.Database.MigrationsPath); err != nil {
|
||||
logger.Error("failed to run migrations", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
logger.Info("migrations completed")
|
||||
} else {
|
||||
logger.Info("skipping migrations at boot (CERTCTL_MIGRATIONS_VIA_HOOK=true — Helm pre-install/pre-upgrade hook owns this work)")
|
||||
}
|
||||
logger.Info("migrations completed")
|
||||
|
||||
// Apply baseline seed data.
|
||||
//
|
||||
@@ -166,12 +210,28 @@ func main() {
|
||||
// server runs RunMigrations above, then this RunSeed call lands the
|
||||
// baseline data — all from a single source of truth (this binary).
|
||||
// See internal/repository/postgres/db.go::RunSeed for the contract.
|
||||
logger.Info("applying baseline seed", "path", cfg.Database.MigrationsPath)
|
||||
if err := postgres.RunSeed(db, cfg.Database.MigrationsPath); err != nil {
|
||||
logger.Error("failed to apply seed data", "error", err)
|
||||
os.Exit(1)
|
||||
//
|
||||
// Phase 4 DEPL-M1: same migration-via-hook gating as RunMigrations.
|
||||
// When the hook owns migrations it also owns the seed pass.
|
||||
if migrateOnly || !migrationsViaHook {
|
||||
logger.Info("applying baseline seed", "path", cfg.Database.MigrationsPath)
|
||||
if err := postgres.RunSeed(db, cfg.Database.MigrationsPath); err != nil {
|
||||
logger.Error("failed to apply seed data", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
logger.Info("seed completed")
|
||||
} else {
|
||||
logger.Info("skipping baseline seed at boot (CERTCTL_MIGRATIONS_VIA_HOOK=true — hook applies seed alongside migrations)")
|
||||
}
|
||||
|
||||
// Phase 4 DEPL-M1: --migrate-only early-exit. Migrations + seed are
|
||||
// done; the operator only asked for the migration pass. Skip the
|
||||
// HTTP listener, scheduler, signing setup, banner, etc. Exit 0
|
||||
// cleanly so Kubernetes Job lifecycle reports success.
|
||||
if migrateOnly {
|
||||
logger.Info("--migrate-only: migrations + seed complete; exiting without starting server lifecycle")
|
||||
os.Exit(0)
|
||||
}
|
||||
logger.Info("seed completed")
|
||||
|
||||
// Apply demo overlay seed when CERTCTL_DEMO_SEED=true. Pre-U-3 the demo
|
||||
// overlay (deploy/docker-compose.demo.yml) mounted seed_demo.sql into
|
||||
|
||||
Reference in New Issue
Block a user