mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 20:01:31 +00:00
47da13e7a1
Bundle 3 closure (2026-05-12 acquisition diligence audit). Closes the
"chart claims production-ready but lying-fields silently break it"
hazard cluster: README install command had wrong key, required secrets
weren't fail-fast, external Postgres rendered the bundled StatefulSet
hostname, container-only security hardening fields landed at pod scope
(silently dropped by K8s API), and three advertised template surfaces
(ServiceMonitor, PodDisruptionBudget, NetworkPolicy) didn't render at
all even when their values.yaml toggles were on.
Source findings closed:
C2 C3 D1 D2 D3 D5 D7 D11 D12 (repo audit)
OPS-L1 OPS-L2 (cowork audit)
Source findings explicitly deferred (tracked in WORKSPACE-ROADMAP.md):
D6 OPS-H1 (backup automation — operator must choose target storage)
D10 (digest pinning of latest `:latest` tags)
OPS-M1 (prometheus/client_golang migration)
OPS-M2 (distributed tracing instrumentation)
Chart truth table (rendered with helm 3.16.3):
-f values.yaml + tls.existingSecret + auth.apiKey + pg.auth.password
→ 12 resources (default mode, no monitoring/PDB/networkpolicy)
+ postgresql.enabled=false + externalDatabase.url=…
→ NO StatefulSet, NO postgres-secret, NO postgres-service (D2)
+ server.tls.certManager.enabled=true
→ +1 Certificate (cert-manager mode)
+ replicas=3 + monitoring.enabled=true + serviceMonitor.enabled=true
+ podDisruptionBudget.enabled=true + networkPolicy.enabled=true
→ +1 ServiceMonitor + 1 PodDisruptionBudget + 1 NetworkPolicy (D5+D11)
tls.existingSecret AND tls.certManager.enabled both set
→ REFUSED with "EXACTLY ONE TLS ownership path" error (D7)
Missing required secrets (apiKey / pg password / external URL)
→ REFUSED at template time with operator-actionable guidance (D1)
Closures by source ID:
C2 — README Helm install example fixed. Was `--set postgresql.password=…`
(does not exist); now `--set postgresql.auth.password=…` matching
the chart key. README install block also wires TLS, mentions
fail-fast at template time, and links the external-Postgres example.
C3 — Kubernetes Secrets connector annotated PREVIEW in values.yaml.
The chart still exposes `kubernetesSecrets.enabled` for the RBAC
preview wiring, but the values block now states clearly that the
production K8s client at internal/connector/target/k8ssecret/
k8ssecret.go::realK8sClient is a stub (verified — go.mod imports
zero k8s.io/client-go packages). Production landing tracked in
WORKSPACE-ROADMAP.md.
D1 — `certctl.requiredSecrets` template helper. Fail-fasts at render
time when (a) server.auth.type=api-key + apiKey empty, (b)
postgresql.enabled=true + pg.auth.password empty, (c)
postgresql.enabled=false + externalDatabase.url + legacy env
CERTCTL_DATABASE_URL all empty. Each branch emits an
operator-actionable diagnostic with the openssl rand command or
values override needed. postgres-secret template additionally
uses Helm's `required` builtin so it can't render with the empty
fallback that pre-Bundle-3 produced ("changeme" literal).
D2 — externalDatabase.url first-class. New top-level values block.
certctl.databaseURL helper now branches on postgresql.enabled:
bundled path uses the helper-emitted in-cluster URL; external
path uses externalDatabase.url verbatim. postgres-secret,
postgres-statefulset, and postgres-service ALL gate on
postgresql.enabled — external mode renders ZERO postgres-*
resources. POSTGRES_PASSWORD env in server-deployment also gates.
D3 — Container-vs-pod security context split. K8s API silently drops
readOnlyRootFilesystem / allowPrivilegeEscalation / capabilities /
privileged when they land at pod scope (`spec.securityContext`);
they only work at container scope (`spec.containers[].securityContext`).
Pre-Bundle-3 all fields sat at pod scope so the chart's documented
"read-only rootfs + drop-all caps" hardening was effectively
unenforced. New certctl.podSecurityContext + containerSecurityContext
helpers split the operator-facing securityContext map by field-name
whitelist so existing values keep working byte-for-byte while
fields render at the K8s-valid scope. Applied to both
server-deployment.yaml and agent-daemonset.yaml (DaemonSet + Deployment
branches).
D5 — Prometheus ServiceMonitor template. New
templates/servicemonitor.yaml. Renders when monitoring.enabled AND
monitoring.serviceMonitor.enabled. Scrapes /api/v1/metrics/prometheus
(rbac-gated on metrics.read — needs bearerTokenSecret with an API
key holding that perm). values.yaml block extended with bearerTokenSecret,
tlsConfig, and relabelings knobs and the operator-facing comment
documenting the auth requirement.
D7 — TLS both-set rejection. certctl.tls.required helper extended.
Pre-Bundle-3 only the NEITHER-set case was caught; setting BOTH
rendered a dangling cert-manager Certificate alongside an
existing-Secret mount, two conflicting TLS sources of truth.
Now refuses with "EXACTLY ONE TLS ownership path" + remediation
steps for both possible operator intents.
D11 — PodDisruptionBudget + NetworkPolicy templates. New
templates/pdb.yaml (renders when podDisruptionBudget.enabled +
server.replicas > 1) + templates/networkpolicy.yaml (renders when
networkPolicy.enabled). PDB uses minAvailable / maxUnavailable
exclusivity per K8s spec. NetworkPolicy default-allows in-namespace
agent → server traffic, kube-DNS egress, and bundled-postgres
egress (when postgresql.enabled), with operator-extensible
extraIngress / extraEgress for CA / OIDC / SMTP egress. Both
default off so existing deploys don't lose network reach
unannounced.
D12 — Database max-conn config wired. Pre-Bundle-3
internal/repository/postgres/db.go::NewDB hard-coded
SetMaxOpenConns(25). config.go loaded CERTCTL_DATABASE_MAX_CONNS,
Validate() enforced the >= 1 floor, values.yaml documented it,
and docs/reference/configuration.md surfaced it — but the pool
ignored every operator setting. New NewDBWithMaxConns threads
the operator value into the pool with maxIdle = maxOpen / 5
(≥ 1) so the historical ratio carries forward. cmd/server/main.go
calls the new constructor; NewDB stays for compat at the default 25.
OPS-L1 — Chart version 0.1.0 → 1.0.0. Chart has shipped through 8 audit
closures since 2026-02 (M-018, U-1, U-2, U-3, H-1, G-1, B1, B2);
pre-1.0 version was implying instability the chart no longer has.
OPS-L2 — External-Postgres path is now properly documented in values.yaml
(externalDatabase block with mode-2 example), README install command
links the existing examples/values-external-db.yaml, and the chart
truth table above proves the external mode renders cleanly.
Receipts:
helm lint deploy/helm/certctl/ # clean
helm template c deploy/helm/certctl/ \
--set server.tls.existingSecret=ci \
--set postgresql.auth.password=p \
--set server.auth.apiKey=k # 12 kinds, default
helm template c deploy/helm/certctl/ \
--set server.tls.existingSecret=ci \
--set postgresql.enabled=false \
--set externalDatabase.url='postgres://u:p@h:5432/db?sslmode=require' \
--set server.auth.apiKey=k # 9 kinds, no postgres-*
helm template c deploy/helm/certctl/ \
--set server.tls.certManager.enabled=true \
--set server.tls.certManager.issuerRef.name=letsencrypt \
--set postgresql.auth.password=p --set server.auth.apiKey=k
# +1 Certificate (cert-manager)
helm template c deploy/helm/certctl/ \
--set server.tls.existingSecret=ci \
--set postgresql.auth.password=p --set server.auth.apiKey=k \
--set server.replicas=3 \
--set monitoring.enabled=true \
--set monitoring.serviceMonitor.enabled=true \
--set podDisruptionBudget.enabled=true \
--set networkPolicy.enabled=true # +ServiceMonitor +PDB +NetworkPolicy
(TLS both-set + missing apiKey + missing pg password + missing extDb URL all REFUSED.)
gofmt -l # clean
go vet ./internal/repository/postgres ./cmd/server # clean
go build ./cmd/server # clean
bash scripts/ci-guards/B3-helm-chart-coherence.sh # clean
Remaining operator warnings (deferred, tracked in WORKSPACE-ROADMAP.md):
- Backup CronJob + restore script (D6 + OPS-H1): operator chooses
target (S3, GCS, Azure Blob, NFS). Sample CronJob yaml may ship
in deploy/helm/examples/ once an operator workstation has run
one full backup-restore cycle.
- Distributed tracing (OPS-M2): otel/* are go.mod indirect deps,
not actively instrumented. Adding spans is a v3 work item.
- Prometheus client_golang migration (OPS-M1): the hand-rolled
/metrics/prometheus exposition format works today; client_golang
migration unlocks histograms + exemplars + native label sets.
Audit-Closes: BUNDLE-3 C2 C3 D1 D2 D3 D5 D7 D11 D12 OPS-L1 OPS-L2
Audit-Defers: D6 D10 OPS-H1 OPS-M1 OPS-M2
279 lines
12 KiB
Go
279 lines
12 KiB
Go
package postgres
|
|
|
|
import (
|
|
"database/sql"
|
|
"errors"
|
|
"fmt"
|
|
"os"
|
|
"path/filepath"
|
|
"strings"
|
|
|
|
"github.com/lib/pq"
|
|
)
|
|
|
|
// pgErrInvalidPassword is the SQLSTATE for class 28 / code 28P01 —
|
|
// invalid_password — emitted by PostgreSQL when the client presents
|
|
// credentials that don't match pg_authid. Defined locally because the
|
|
// lib/pq package does not export named constants for SQLSTATE codes (it
|
|
// only exposes the typed string alias pq.ErrorCode and a name-lookup map
|
|
// at runtime). Pinned as a string constant rather than a pq.ErrorCode
|
|
// literal so the contract is grep-able from operator-facing log lines.
|
|
//
|
|
// Reference: https://www.postgresql.org/docs/16/errcodes-appendix.html
|
|
const pgErrInvalidPassword = "28P01"
|
|
|
|
// NewDB opens a PostgreSQL database connection and sets up connection pooling.
|
|
//
|
|
// The pool size is hard-coded here for backward compatibility with call
|
|
// sites that don't pass an explicit limit. The Bundle 3 closure (D12)
|
|
// added NewDBWithMaxConns so the operator-facing CERTCTL_DATABASE_MAX_CONNS
|
|
// env var actually flows through to db.SetMaxOpenConns; cmd/server/main.go
|
|
// uses NewDBWithMaxConns now. Idle conns are kept at MaxOpenConns/5
|
|
// (rounded up to at least 1) so the ratio behavior pre-Bundle-3 (5 idle
|
|
// out of 25 max) generalizes to any operator-supplied limit.
|
|
func NewDB(connStr string) (*sql.DB, error) {
|
|
return NewDBWithMaxConns(connStr, 25)
|
|
}
|
|
|
|
// NewDBWithMaxConns opens a PostgreSQL database connection and sets the
|
|
// connection-pool max-open + max-idle limits.
|
|
//
|
|
// Bundle 3 closure (D12): pre-Bundle-3 the pool was hard-coded to
|
|
// SetMaxOpenConns(25) regardless of the operator's
|
|
// CERTCTL_DATABASE_MAX_CONNS setting. The config field was loaded by
|
|
// internal/config/config.go (validated to be >= 1), wired into values.yaml's
|
|
// `CERTCTL_DATABASE_MAX_CONNS: "25"` example comment, AND surfaced in
|
|
// docs/reference/configuration.md as if it took effect — but the
|
|
// underlying pool ignored it. Operators tuning for higher load found
|
|
// no behavioral change.
|
|
//
|
|
// Post-Bundle-3 the operator-facing knob actually flows here. The
|
|
// idle-pool size scales with maxOpen so the historical 1:5 ratio
|
|
// (5 idle of 25 max) carries forward to any operator-supplied size.
|
|
// Floors: maxOpen guaranteed >= 1 by config.Validate; maxIdle floored
|
|
// to 1 to avoid the 0-idle pathological case where every query opens
|
|
// a fresh connection.
|
|
func NewDBWithMaxConns(connStr string, maxOpen int) (*sql.DB, error) {
|
|
if maxOpen < 1 {
|
|
maxOpen = 1
|
|
}
|
|
db, err := sql.Open("postgres", connStr)
|
|
if err != nil {
|
|
return nil, fmt.Errorf("failed to open database: %w", err)
|
|
}
|
|
|
|
// Configure connection pool.
|
|
db.SetMaxOpenConns(maxOpen)
|
|
maxIdle := maxOpen / 5
|
|
if maxIdle < 1 {
|
|
maxIdle = 1
|
|
}
|
|
db.SetMaxIdleConns(maxIdle)
|
|
|
|
// Ping to verify connection.
|
|
if err := db.Ping(); err != nil {
|
|
return nil, wrapPingError(err)
|
|
}
|
|
|
|
return db, nil
|
|
}
|
|
|
|
// wrapPingError converts a db.Ping() failure into an operator-friendly
|
|
// diagnostic. The default wrap is the original opaque
|
|
// `"failed to ping database: <inner>"` shape. The exception is SQLSTATE 28P01
|
|
// (invalid_password): when postgres rejects the server's credentials we emit
|
|
// extended guidance that names the most common operator misstep — editing
|
|
// POSTGRES_PASSWORD in `.env` after the postgres named volume has already
|
|
// been initialized — and lists both the destructive (`docker compose down -v`)
|
|
// and non-destructive (`ALTER ROLE`) remediations.
|
|
//
|
|
// U-1 (P1, GitHub #10): closes the audit-flagged
|
|
// cat-u-quickstart_postgres_password_volume_trap finding. The postgres
|
|
// docker-entrypoint runs initdb only when /var/lib/postgresql/data is empty;
|
|
// on subsequent boots the password baked into pg_authid on first boot wins
|
|
// over whatever the env var carries, so the env-vs-pg_authid divergence is
|
|
// intrinsic to how the postgres image bootstraps and cannot be fixed by us
|
|
// upstream of pg_authid. The ergonomic answer is to surface a clear
|
|
// diagnostic at the failure site so operators don't waste an hour on
|
|
// "is my password right" before discovering the volume needs to be torn
|
|
// down (or the role's password rotated in-place).
|
|
//
|
|
// The wrap chain is preserved via fmt.Errorf("%w", err) so callers using
|
|
// errors.As(err, &*pq.Error) on the returned value continue to work; this
|
|
// matches the audit's "no substring matching on err.Error()" requirement
|
|
// from the M-1 sentinel-error migration.
|
|
//
|
|
// Returns nil when err is nil so callers can defensively pipe through this
|
|
// helper without an extra branch.
|
|
func wrapPingError(err error) error {
|
|
if err == nil {
|
|
return nil
|
|
}
|
|
|
|
var pqErr *pq.Error
|
|
if errors.As(err, &pqErr) && string(pqErr.Code) == pgErrInvalidPassword {
|
|
return fmt.Errorf(
|
|
"failed to ping database: postgres rejected the configured credentials "+
|
|
"(SQLSTATE %s — invalid_password). If you recently rotated POSTGRES_PASSWORD "+
|
|
"on a docker-compose deploy, the postgres container's data volume still "+
|
|
"holds the previous password: initdb seeds POSTGRES_PASSWORD into pg_authid "+
|
|
"only on first boot of a fresh data dir, so editing the env var after that "+
|
|
"point updates only the certctl-server container. Reset destructively with "+
|
|
"`docker compose -f deploy/docker-compose.yml down -v && "+
|
|
"docker compose -f deploy/docker-compose.yml up -d --build` (this DESTROYS "+
|
|
"all data in the postgres volume), or non-destructively with "+
|
|
"`docker compose -f deploy/docker-compose.yml exec postgres "+
|
|
"psql -U certctl -c \"ALTER ROLE certctl PASSWORD '<new-password>';\"` "+
|
|
"and then redeploy with the matching POSTGRES_PASSWORD. Underlying error: %w",
|
|
pgErrInvalidPassword, err)
|
|
}
|
|
|
|
return fmt.Errorf("failed to ping database: %w", err)
|
|
}
|
|
|
|
// RunMigrations reads and executes SQL migration files from a directory.
|
|
func RunMigrations(db *sql.DB, migrationsPath string) error {
|
|
// Check if migrations directory exists
|
|
if _, err := os.Stat(migrationsPath); os.IsNotExist(err) {
|
|
return fmt.Errorf("migrations directory not found: %s", migrationsPath)
|
|
}
|
|
|
|
// Read all SQL files from the migrations directory
|
|
files, err := os.ReadDir(migrationsPath)
|
|
if err != nil {
|
|
return fmt.Errorf("failed to read migrations directory: %w", err)
|
|
}
|
|
|
|
// Sort and filter to only .up.sql migration files (skip .down.sql rollbacks and seed files)
|
|
var sqlFiles []string
|
|
for _, file := range files {
|
|
if !file.IsDir() && strings.HasSuffix(file.Name(), ".up.sql") {
|
|
sqlFiles = append(sqlFiles, file.Name())
|
|
}
|
|
}
|
|
|
|
// Execute each migration file in order
|
|
for _, filename := range sqlFiles {
|
|
filePath := filepath.Join(migrationsPath, filename)
|
|
content, err := os.ReadFile(filePath)
|
|
if err != nil {
|
|
return fmt.Errorf("failed to read migration file %s: %w", filename, err)
|
|
}
|
|
|
|
// Execute the SQL content
|
|
if _, err := db.Exec(string(content)); err != nil {
|
|
return fmt.Errorf("failed to execute migration %s: %w", filename, err)
|
|
}
|
|
}
|
|
|
|
return nil
|
|
}
|
|
|
|
// RunSeed reads and executes the baseline seed SQL file from the migrations
|
|
// directory. Designed to run AFTER RunMigrations so every column referenced by
|
|
// the seed is already in place.
|
|
//
|
|
// U-3 (P1, cat-u-seed_initdb_schema_drift): pre-U-3 the deploy compose stack
|
|
// mounted both a hand-curated subset of `migrations/*.up.sql` and `seed.sql`
|
|
// into postgres `/docker-entrypoint-initdb.d/`. Postgres applied them at
|
|
// initdb time. When `seed.sql` was updated to reference columns added by
|
|
// migrations *after* the mounted cutoff (e.g., `policy_rules.severity` from
|
|
// `000013_policy_rule_severity.up.sql`), initdb crashed during the seed step
|
|
// and the container was reported `unhealthy` indefinitely — bare
|
|
// `docker compose -f deploy/docker-compose.yml up -d --build` from a fresh
|
|
// clone of v2.0.50 hit this on the first try (GitHub #10 reopened by
|
|
// mikeakasully). Helm and the example compose files were already runtime-
|
|
// only (Path B) and worked through the same window.
|
|
//
|
|
// Post-U-3 the compose stack drops all initdb mounts; postgres comes up with
|
|
// an empty schema; the server applies all migrations via RunMigrations and
|
|
// then this function applies the seed. Single source of truth, removes the
|
|
// drift hazard architecturally.
|
|
//
|
|
// The seed file is expected at `<migrationsPath>/seed.sql`. Missing-file is
|
|
// treated as a no-op (returns nil) so deployments that explicitly remove the
|
|
// seed (custom packaging, cert-manager managed schemas) don't break.
|
|
//
|
|
// Idempotency: every INSERT in the shipped seed.sql uses
|
|
// `ON CONFLICT (id) DO NOTHING`, so re-running on a populated DB is safe.
|
|
// This function is invoked on every server start, so the contract MUST hold.
|
|
//
|
|
// Demo seed: `seed_demo.sql` is applied separately by RunDemoSeed below
|
|
// when CERTCTL_DEMO_SEED=true (see internal/config/config.go::DemoSeed).
|
|
// Splitting demo from baseline keeps a default deploy from accidentally
|
|
// landing 90-days-of-fake-history into a real customer database, while
|
|
// still giving the demo overlay a single source of truth (no more initdb
|
|
// mounts). The demo seed itself uses ON CONFLICT (id) DO NOTHING so it's
|
|
// idempotent; missing-file is also tolerated (custom packaging may strip
|
|
// seed_demo.sql to shrink the image).
|
|
func RunSeed(db *sql.DB, migrationsPath string) error {
|
|
if _, err := os.Stat(migrationsPath); os.IsNotExist(err) {
|
|
return fmt.Errorf("migrations directory not found: %s", migrationsPath)
|
|
}
|
|
|
|
seedPath := filepath.Join(migrationsPath, "seed.sql")
|
|
content, err := os.ReadFile(seedPath)
|
|
if err != nil {
|
|
if os.IsNotExist(err) {
|
|
// Missing seed.sql is acceptable — operators may have removed it
|
|
// for custom-packaging reasons. Return nil rather than fail-loud.
|
|
return nil
|
|
}
|
|
return fmt.Errorf("failed to read seed file %s: %w", seedPath, err)
|
|
}
|
|
|
|
if _, err := db.Exec(string(content)); err != nil {
|
|
return fmt.Errorf("failed to execute seed file %s: %w", seedPath, err)
|
|
}
|
|
|
|
return nil
|
|
}
|
|
|
|
// RunDemoSeed applies the demo overlay seed file
|
|
// (`<migrationsPath>/seed_demo.sql`) on top of the baseline seed.
|
|
//
|
|
// U-3 follow-on: pre-U-3 the demo overlay mounted `seed_demo.sql` into
|
|
// postgres `/docker-entrypoint-initdb.d/` and relied on initdb to apply it
|
|
// alongside the schema. Once U-3 dropped the initdb migration mounts, that
|
|
// path stopped working — postgres comes up empty, and the demo seed
|
|
// references tables (issuers, certificates, etc.) that wouldn't exist yet
|
|
// at initdb time. RunDemoSeed restores the demo capability through the
|
|
// same runtime path RunSeed uses, gated by CERTCTL_DEMO_SEED so production
|
|
// deploys never accidentally land the fake-history rows.
|
|
//
|
|
// Order contract: must run AFTER RunSeed so foreign-key references from
|
|
// demo rows to baseline rows (e.g., demo certificates referencing
|
|
// `rp-default` from baseline) resolve cleanly. The caller in
|
|
// cmd/server/main.go enforces this order.
|
|
//
|
|
// Missing-file is acceptable (returns nil) — operators packaging a
|
|
// production-only image often strip seed_demo.sql to shrink the artifact,
|
|
// and that should not break boot when CERTCTL_DEMO_SEED happens to be set.
|
|
//
|
|
// Idempotency: every INSERT in seed_demo.sql uses
|
|
// `ON CONFLICT (id) DO NOTHING`, so re-running on a populated DB is safe.
|
|
// Server restarts in demo mode therefore re-apply the file harmlessly.
|
|
func RunDemoSeed(db *sql.DB, migrationsPath string) error {
|
|
if _, err := os.Stat(migrationsPath); os.IsNotExist(err) {
|
|
return fmt.Errorf("migrations directory not found: %s", migrationsPath)
|
|
}
|
|
|
|
seedPath := filepath.Join(migrationsPath, "seed_demo.sql")
|
|
content, err := os.ReadFile(seedPath)
|
|
if err != nil {
|
|
if os.IsNotExist(err) {
|
|
// Custom production packaging frequently strips this file.
|
|
// Fail-soft to preserve the U-3 contract: a missing seed file
|
|
// must not gate server boot.
|
|
return nil
|
|
}
|
|
return fmt.Errorf("failed to read demo seed file %s: %w", seedPath, err)
|
|
}
|
|
|
|
if _, err := db.Exec(string(content)); err != nil {
|
|
return fmt.Errorf("failed to execute demo seed file %s: %w", seedPath, err)
|
|
}
|
|
|
|
return nil
|
|
}
|