fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master)

GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the
canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build);
postgres reported unhealthy indefinitely, dependent containers never started.

Root cause: deploy/docker-compose.yml mounted a hand-curated subset of
migrations/*.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/.
Postgres applied them at initdb time. Once seed.sql referenced columns added
by migrations *after* the mounted cutoff (e.g., policy_rules.severity from
migration 000013), initdb crashed mid-seed and the container loop wedged.
Two sources of truth (compose mount list vs in-tree migration ladder)
diverged the moment a seed-touching migration shipped, and the only thing
that fixed it was hand-editing the compose file every release.

Fix: remove the dual source. Postgres boots empty; the server applies
migrations + seed at startup via RunMigrations + RunSeed. Helm has used
this pattern since day one (postgres-init emptyDir); compose now matches.

Bundled with four ride-along audit findings whose fixes share the same
schema/db code surface, so operators take the schema-change pain only once:

  cat-u-seed_initdb_schema_drift           [P1, primary] — initdb-mount fix
  cat-o-retry_interval_unit_mismatch       [P1] — column rename minutes→seconds
  cat-o-notification_created_at_dead_field [P2] — add column + populate
  cat-o-health_check_column_orphans        [P1] — drop unwired columns
  cat-u-no_version_endpoint                [P2] — add /api/v1/version

Single migration (000017_db_coupling_cleanup) bundles the three schema
changes under a DO \$\$ guard so re-application is safe; reduces
operator-visible 'schema-change releases' from four to one.

Backend
- internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed
  (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in
  every shipped INSERT) so repeated boots are safe; missing-file is no-op
  so custom packaging that strips seeds still boots cleanly.
- cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set)
  immediately after RunMigrations.
- internal/repository/postgres/notification.go: NotificationRepository.Create
  now sets created_at (with time.Now() fallback when caller leaves it zero);
  scanNotification reads it back; List + ListRetryEligible SELECT extended.
- internal/repository/postgres/renewal_policy.go: column references updated
  to retry_interval_seconds across SELECT/INSERT/UPDATE sites.
- internal/api/handler/version.go: new VersionHandler exposes
  {version, commit, modified, build_time, go_version} from
  runtime/debug.ReadBuildInfo() with ldflags-supplied Version override.
- internal/api/router/router.go: register GET /api/v1/version through the
  no-auth chain (CORS + ContentType) alongside /health, /ready,
  /api/v1/auth/info.
- cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit
  ExcludePaths so rollout polling doesn't dominate the audit trail.
- internal/config/config.go: add DatabaseConfig.DemoSeed +
  CERTCTL_DEMO_SEED env var.

Migration
- migrations/000017_db_coupling_cleanup.up.sql + .down.sql:
    (1) renewal_policies.retry_interval_minutes → retry_interval_seconds
        (DO \$\$ guard, idempotent re-application)
    (2) notification_events ADD COLUMN created_at TIMESTAMPTZ
        NOT NULL DEFAULT NOW()
    (3) network_scan_targets DROP orphan health_check_enabled +
        health_check_interval_seconds
- migrations/seed.sql: column reference updated to retry_interval_seconds.
- migrations/seed_demo.sql: same column rename + applied at runtime now via
  RunDemoSeed (no longer initdb-mounted).

Compose
- deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files +
  seed.sql); add start_period: 30s to postgres + certctl-server healthchecks
  to absorb the runtime migration + seed application window on first boot.
- deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount
  removed; that file never existed); same healthcheck start_period.
- deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with
  CERTCTL_DEMO_SEED=true env var on certctl-server.

Tests
- internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo,
  TestVersion_RejectsNonGet, TestVersion_LdflagsOverride.
- internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently,
  TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently,
  TestMigration000017_RetryIntervalRename,
  TestMigration000017_NotificationCreatedAt,
  TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips).
- internal/repository/postgres/notification_test.go:
  TestNotificationRepository_CreatedAt_IsPersisted +
  TestNotificationRepository_CreatedAt_DefaultsToNow.

CI guardrail
- .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb
  (U-3)' step grep-fails the build if any migrations/*.sql or seed*.sql
  re-appears in /docker-entrypoint-initdb.d in any compose file. Catches
  future drift before a fresh-clone operator hits it.

Spec / Docs
- api/openapi.yaml: add /api/v1/version operation under Health tag.
- docs/architecture.md: replace the 'initdb may run the same SQL' paragraph
  with a post-U-3 single-source-of-truth explanation.
- CHANGELOG.md: full unreleased-section entry covering all 5 closures,
  breaking changes, and the new env var.

Audit doc
- coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14
  cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to
   RESOLVED with closure prose pointing at this commit.

Verification: build/vet/test -short -race all clean across all touched
packages locally; govulncheck reports 0 vulnerabilities affecting our
code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the
post-fix tree.
This commit is contained in:
shankar0123
2026-04-25 13:29:23 +00:00
parent aa6fafdee9
commit a3d8b9c607
23 changed files with 1157 additions and 51 deletions
+158
View File
@@ -0,0 +1,158 @@
package handler
import (
"net/http"
"runtime"
"runtime/debug"
)
// VersionHandler exposes the running server's build identity at
// /api/v1/version. U-3 ride-along (cat-u-no_version_endpoint, P2): pre-U-3
// there was no in-band way for an operator (or an automated rollout system)
// to ask "what version of certctl is this binary?" — they had to either read
// the container image tag externally or trust whatever the README said. The
// gap matters for the same operability story U-3 closes: when fresh-clone
// quickstarts fail, the very first question is "what code did I actually
// build", and the only honest answer needs to come from the binary itself.
//
// VersionInfo is populated from three sources, in priority order:
//
// 1. The Version field — typically supplied at build time via
// `-ldflags='-X github.com/shankar0123/certctl/internal/api/handler.Version=v2.0.50'`.
// Production releases set this from the git tag (see release.yml).
//
// 2. runtime/debug.ReadBuildInfo() — populated by Go 1.18+ for any binary
// built from a module. Provides the VCS commit SHA, dirty flag, and
// build timestamp. We read these fields directly so a `go build` from a
// working tree (no -ldflags incantation) still produces a useful
// /api/v1/version payload — the failure mode pre-U-3 was that everything
// looked like "dev" everywhere, which made "is the bug fixed in this
// binary" unanswerable.
//
// 3. Static fallbacks ("dev" / "unknown") — only reached when neither
// ldflags nor build-info are populated, which in practice means
// `go run` from a non-VCS-tracked workspace.
//
// The handler runs through the no-auth bypass dispatch in cmd/server/main.go
// so probes and rollout systems can query it without presenting Bearer
// credentials, mirroring how /health and /ready are reachable. Audit logging
// excludes /api/v1/version for the same reason — the path is hot under
// rollout polling and would otherwise dominate the audit trail.
type VersionHandler struct{}
// Version is overridden at build time via:
//
// -ldflags='-X github.com/shankar0123/certctl/internal/api/handler.Version=<tag>'
//
// release.yml does this for the server container and CLI/agent binaries.
// The empty default (rather than "dev") lets the Handler fall back to the
// runtime/debug VCS revision when ldflags wasn't supplied — preferable to
// returning a literal "dev" that masks the actual git SHA the binary was
// built from.
var Version = ""
// NewVersionHandler returns a value (not a pointer) to match the
// HealthHandler convention — the handler holds no mutable state and is
// safe to copy.
func NewVersionHandler() VersionHandler {
return VersionHandler{}
}
// VersionInfo is the JSON shape returned by GET /api/v1/version.
//
// Field ordering and tag names are part of the contract — operator tooling
// (k8s rollout checks, CI smoke tests, /api/v1/version Prometheus blackbox
// probes) parses this payload and must continue to work across releases.
// Don't rename a field without an OpenAPI bump and a deprecation cycle.
type VersionInfo struct {
// Version is the human-readable release identifier (e.g. "v2.0.50").
// Falls back to the VCS revision when ldflags wasn't set, and to "dev"
// when the build wasn't VCS-tracked at all.
Version string `json:"version"`
// Commit is the git SHA of HEAD at build time, sourced from
// runtime/debug.BuildInfo.Settings["vcs.revision"]. Empty string when
// the binary was built outside a VCS-tracked workspace (rare —
// `go build` from a tarball does this).
Commit string `json:"commit"`
// Modified reports whether the build had uncommitted changes
// (debug.BuildInfo.Settings["vcs.modified"]). True for developer
// builds, false for release builds out of CI.
Modified bool `json:"modified"`
// BuildTime is the RFC 3339 timestamp captured at build time
// (debug.BuildInfo.Settings["vcs.time"]). Empty when not VCS-tracked.
BuildTime string `json:"build_time"`
// GoVersion is the Go toolchain version that compiled the binary
// (runtime.Version, e.g. "go1.25.9"). Useful when triaging stdlib
// behavior differences ("the deploy that broke was on 1.24, this one
// is on 1.25").
GoVersion string `json:"go_version"`
}
// readBuildInfo extracts the VCS settings from debug.BuildInfo and pairs
// them with the ldflags-supplied Version. Split out from ServeHTTP so the
// handler can be unit-tested by injecting synthetic BuildInfo (see
// version_handler_test.go) without depending on the test binary's actual
// debug info.
//
// debug.ReadBuildInfo returns ok=false when the binary was built without
// module info — extremely rare for a Go 1.18+ build, but we guard it so
// the handler degrades to "dev / unknown / runtime.Version()" instead of
// nil-deref panicking.
func readBuildInfo() VersionInfo {
info := VersionInfo{
Version: Version,
GoVersion: runtime.Version(),
}
bi, ok := debug.ReadBuildInfo()
if !ok {
// Pre-Go 1.18 binary or a stripped build with no buildinfo segment.
// Both are pathological in 2026 but worth the two-line guard.
if info.Version == "" {
info.Version = "dev"
}
return info
}
for _, s := range bi.Settings {
switch s.Key {
case "vcs.revision":
info.Commit = s.Value
case "vcs.modified":
// debug.BuildInfo encodes this as the literal string "true" or
// "false"; comparing to "true" is the canonical pattern (mirrors
// how the standard library's own version sub-command parses it).
info.Modified = s.Value == "true"
case "vcs.time":
info.BuildTime = s.Value
}
}
// Fallback ladder for Version: ldflags > VCS commit > "dev". The git
// SHA is more useful than "dev" because it's at least groundable — an
// operator can `git show <sha>` to see what code is actually running.
if info.Version == "" {
if info.Commit != "" {
info.Version = info.Commit
} else {
info.Version = "dev"
}
}
return info
}
// ServeHTTP implements http.Handler. Returns the VersionInfo payload as
// JSON with a 200 status. GET-only — any other method returns 405, matching
// the HealthHandler convention.
func (h VersionHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
return
}
JSON(w, http.StatusOK, readBuildInfo())
}
@@ -0,0 +1,108 @@
package handler
import (
"encoding/json"
"net/http"
"net/http/httptest"
"runtime"
"strings"
"testing"
)
// TestVersion_ReturnsBuildInfo is the regression for the U-3 ride-along
// cat-u-no_version_endpoint (P2). Three behaviors must hold for the
// endpoint to be useful in operator tooling:
//
// 1. GET /api/v1/version returns 200 with a JSON body that decodes into
// the documented VersionInfo shape — the wire contract that rollout
// systems and Prometheus blackbox probes parse.
// 2. The Go runtime version always populates (runtime.Version() can never
// return empty), so consumers can always answer "which Go did this
// binary compile with" even when ldflags / VCS info are missing.
// 3. The Version field is never empty — the fallback ladder
// (ldflags > VCS commit > "dev") guarantees a non-empty string so
// consumers don't have to special-case absent values.
//
// We don't pin the exact Version value because it depends on whether the
// test binary was built with -ldflags or under `go test`, both of which
// the handler must tolerate. The "no empty string" check is the
// behavioral contract.
func TestVersion_ReturnsBuildInfo(t *testing.T) {
h := NewVersionHandler()
req := httptest.NewRequest(http.MethodGet, "/api/v1/version", nil)
rec := httptest.NewRecorder()
h.ServeHTTP(rec, req)
if rec.Code != http.StatusOK {
t.Fatalf("status = %d, want 200", rec.Code)
}
contentType := rec.Header().Get("Content-Type")
if !strings.HasPrefix(contentType, "application/json") {
t.Errorf("Content-Type = %q, want application/json prefix (operator tooling parses JSON)", contentType)
}
var got VersionInfo
if err := json.NewDecoder(rec.Body).Decode(&got); err != nil {
t.Fatalf("response body did not decode into VersionInfo: %v\nbody: %s", err, rec.Body.String())
}
// Version must never be empty — the fallback ladder in readBuildInfo
// guarantees this. An empty Version would force every downstream
// consumer (k8s rollouts, Prometheus blackbox, the support tooling)
// to special-case the missing value, which defeats the point of
// /api/v1/version existing.
if got.Version == "" {
t.Error("Version is empty — the fallback ladder (ldflags > VCS commit > 'dev') must guarantee a non-empty value")
}
// GoVersion must equal runtime.Version() — the handler reads it
// directly and cannot be subverted by ldflags or BuildInfo. This is
// the one field that should always be ground-truth.
if got.GoVersion != runtime.Version() {
t.Errorf("GoVersion = %q, want %q (must come straight from runtime.Version())",
got.GoVersion, runtime.Version())
}
}
// TestVersion_RejectsNonGet pins the GET-only contract. /api/v1/version
// is read-only build identity; POST/PUT/DELETE etc. are nonsensical and
// should return 405 like the HealthHandler does. Operator tooling that
// fat-fingers the verb gets a clear error rather than a confusing 200
// from the wrong code path.
func TestVersion_RejectsNonGet(t *testing.T) {
h := NewVersionHandler()
for _, method := range []string{
http.MethodPost, http.MethodPut, http.MethodDelete, http.MethodPatch,
} {
req := httptest.NewRequest(method, "/api/v1/version", nil)
rec := httptest.NewRecorder()
h.ServeHTTP(rec, req)
if rec.Code != http.StatusMethodNotAllowed {
t.Errorf("%s /api/v1/version → status %d, want 405", method, rec.Code)
}
}
}
// TestVersion_LdflagsOverride locks in the priority order: when the
// build-time Version variable is non-empty (e.g. "v2.0.50" injected by
// release.yml), readBuildInfo MUST surface that value verbatim and not
// silently substitute the VCS commit. The release-pipeline contract
// depends on this — a release tagged v2.0.50 should report "v2.0.50",
// not the underlying SHA.
//
// We achieve test isolation by save/restore on the package-level Version
// variable; t.Cleanup ensures parallel/subsequent tests see the original.
func TestVersion_LdflagsOverride(t *testing.T) {
original := Version
t.Cleanup(func() { Version = original })
Version = "v2.0.50-test"
got := readBuildInfo()
if got.Version != "v2.0.50-test" {
t.Errorf("Version = %q, want %q (ldflags-supplied Version must take priority over VCS fallback)",
got.Version, "v2.0.50-test")
}
}
+16
View File
@@ -68,6 +68,11 @@ type HandlerRegistry struct {
HealthChecks *handler.HealthCheckHandler
BulkRevocation handler.BulkRevocationHandler
RenewalPolicies handler.RenewalPolicyHandler
// Version handles GET /api/v1/version (U-3 ride-along,
// cat-u-no_version_endpoint). Wired through the no-auth dispatch in
// cmd/server/main.go so probes and rollout systems can read build
// identity without Bearer credentials. See handler/version.go.
Version handler.VersionHandler
}
// RegisterHandlers sets up all API routes with their handlers.
@@ -89,6 +94,17 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
middleware.CORS,
middleware.ContentType,
))
// Version endpoint (no auth middleware — used by rollout probes that
// don't carry Bearer tokens; the dispatch layer in cmd/server/main.go
// also routes /api/v1/version through the no-auth chain). U-3 ride-along
// (cat-u-no_version_endpoint, P2). The handler reads
// runtime/debug.BuildInfo for VCS attribution; ldflags-supplied Version
// is preferred when present.
r.mux.Handle("GET /api/v1/version", middleware.Chain(
reg.Version,
middleware.CORS,
middleware.ContentType,
))
// Auth check endpoint (uses full middleware chain via r.Register)
r.Register("GET /api/v1/auth/check", http.HandlerFunc(reg.Health.AuthCheck))
+11
View File
@@ -709,6 +709,16 @@ type DatabaseConfig struct {
URL string
MaxConnections int
MigrationsPath string
// DemoSeed, when true, makes the server apply
// `<MigrationsPath>/seed_demo.sql` after the baseline `seed.sql`. Set
// via CERTCTL_DEMO_SEED. The compose demo overlay
// (deploy/docker-compose.demo.yml) sets this to keep the demo path
// alive after U-3 dropped initdb-mounted seed files. The seed file
// uses ON CONFLICT (id) DO NOTHING so re-running on a populated
// database is safe; missing-file is a no-op (returns nil) so a
// minimal-image deploy that strips seed_demo.sql still boots cleanly.
DemoSeed bool
}
// SchedulerConfig contains scheduler timing configuration.
@@ -921,6 +931,7 @@ func Load() (*Config, error) {
URL: getEnv("CERTCTL_DATABASE_URL", "postgres://localhost/certctl"),
MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 25),
MigrationsPath: getEnv("CERTCTL_DATABASE_MIGRATIONS_PATH", "./migrations"),
DemoSeed: getEnvBool("CERTCTL_DEMO_SEED", false),
},
Scheduler: SchedulerConfig{
RenewalCheckInterval: getEnvDuration("CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL", 1*time.Hour),
+108
View File
@@ -131,3 +131,111 @@ func RunMigrations(db *sql.DB, migrationsPath string) error {
return nil
}
// RunSeed reads and executes the baseline seed SQL file from the migrations
// directory. Designed to run AFTER RunMigrations so every column referenced by
// the seed is already in place.
//
// U-3 (P1, cat-u-seed_initdb_schema_drift): pre-U-3 the deploy compose stack
// mounted both a hand-curated subset of `migrations/*.up.sql` and `seed.sql`
// into postgres `/docker-entrypoint-initdb.d/`. Postgres applied them at
// initdb time. When `seed.sql` was updated to reference columns added by
// migrations *after* the mounted cutoff (e.g., `policy_rules.severity` from
// `000013_policy_rule_severity.up.sql`), initdb crashed during the seed step
// and the container was reported `unhealthy` indefinitely — bare
// `docker compose -f deploy/docker-compose.yml up -d --build` from a fresh
// clone of v2.0.50 hit this on the first try (GitHub #10 reopened by
// mikeakasully). Helm and the example compose files were already runtime-
// only (Path B) and worked through the same window.
//
// Post-U-3 the compose stack drops all initdb mounts; postgres comes up with
// an empty schema; the server applies all migrations via RunMigrations and
// then this function applies the seed. Single source of truth, removes the
// drift hazard architecturally.
//
// The seed file is expected at `<migrationsPath>/seed.sql`. Missing-file is
// treated as a no-op (returns nil) so deployments that explicitly remove the
// seed (custom packaging, cert-manager managed schemas) don't break.
//
// Idempotency: every INSERT in the shipped seed.sql uses
// `ON CONFLICT (id) DO NOTHING`, so re-running on a populated DB is safe.
// This function is invoked on every server start, so the contract MUST hold.
//
// Demo seed: `seed_demo.sql` is applied separately by RunDemoSeed below
// when CERTCTL_DEMO_SEED=true (see internal/config/config.go::DemoSeed).
// Splitting demo from baseline keeps a default deploy from accidentally
// landing 90-days-of-fake-history into a real customer database, while
// still giving the demo overlay a single source of truth (no more initdb
// mounts). The demo seed itself uses ON CONFLICT (id) DO NOTHING so it's
// idempotent; missing-file is also tolerated (custom packaging may strip
// seed_demo.sql to shrink the image).
func RunSeed(db *sql.DB, migrationsPath string) error {
if _, err := os.Stat(migrationsPath); os.IsNotExist(err) {
return fmt.Errorf("migrations directory not found: %s", migrationsPath)
}
seedPath := filepath.Join(migrationsPath, "seed.sql")
content, err := os.ReadFile(seedPath)
if err != nil {
if os.IsNotExist(err) {
// Missing seed.sql is acceptable — operators may have removed it
// for custom-packaging reasons. Return nil rather than fail-loud.
return nil
}
return fmt.Errorf("failed to read seed file %s: %w", seedPath, err)
}
if _, err := db.Exec(string(content)); err != nil {
return fmt.Errorf("failed to execute seed file %s: %w", seedPath, err)
}
return nil
}
// RunDemoSeed applies the demo overlay seed file
// (`<migrationsPath>/seed_demo.sql`) on top of the baseline seed.
//
// U-3 follow-on: pre-U-3 the demo overlay mounted `seed_demo.sql` into
// postgres `/docker-entrypoint-initdb.d/` and relied on initdb to apply it
// alongside the schema. Once U-3 dropped the initdb migration mounts, that
// path stopped working — postgres comes up empty, and the demo seed
// references tables (issuers, certificates, etc.) that wouldn't exist yet
// at initdb time. RunDemoSeed restores the demo capability through the
// same runtime path RunSeed uses, gated by CERTCTL_DEMO_SEED so production
// deploys never accidentally land the fake-history rows.
//
// Order contract: must run AFTER RunSeed so foreign-key references from
// demo rows to baseline rows (e.g., demo certificates referencing
// `rp-default` from baseline) resolve cleanly. The caller in
// cmd/server/main.go enforces this order.
//
// Missing-file is acceptable (returns nil) — operators packaging a
// production-only image often strip seed_demo.sql to shrink the artifact,
// and that should not break boot when CERTCTL_DEMO_SEED happens to be set.
//
// Idempotency: every INSERT in seed_demo.sql uses
// `ON CONFLICT (id) DO NOTHING`, so re-running on a populated DB is safe.
// Server restarts in demo mode therefore re-apply the file harmlessly.
func RunDemoSeed(db *sql.DB, migrationsPath string) error {
if _, err := os.Stat(migrationsPath); os.IsNotExist(err) {
return fmt.Errorf("migrations directory not found: %s", migrationsPath)
}
seedPath := filepath.Join(migrationsPath, "seed_demo.sql")
content, err := os.ReadFile(seedPath)
if err != nil {
if os.IsNotExist(err) {
// Custom production packaging frequently strips this file.
// Fail-soft to preserve the U-3 contract: a missing seed file
// must not gate server boot.
return nil
}
return fmt.Errorf("failed to read demo seed file %s: %w", seedPath, err)
}
if _, err := db.Exec(string(content)); err != nil {
return fmt.Errorf("failed to execute demo seed file %s: %w", seedPath, err)
}
return nil
}
+37 -11
View File
@@ -22,19 +22,37 @@ func NewNotificationRepository(db *sql.DB) *NotificationRepository {
return &NotificationRepository{db: db}
}
// Create stores a new notification
// Create stores a new notification.
//
// U-3 ride-along (cat-o-notification_created_at_dead_field, P2): the
// `created_at` column is added to notification_events by migration 000017.
// Pre-U-3 the Go domain.NotificationEvent had a CreatedAt field but the
// INSERT path never set it AND no DB column existed — the JSON API
// serialised the field as `0001-01-01T00:00:00Z`, breaking timestamp
// ordering on operator dashboards and any consumer that filtered by age.
// Post-U-3 the column exists with a NOT NULL DEFAULT NOW() backstop, and
// this INSERT explicitly sets it from the domain field. If the caller
// hasn't populated CreatedAt (zero-value time.Time) we substitute
// time.Now() so the row never carries the placeholder zero-time forward
// — the DEFAULT would handle this too, but emitting the value explicitly
// keeps the wire-level JSON consistent with what the row will hold once
// scanNotification reads it back, and prevents a clock-skew gap between
// "Go computed CreatedAt" and "DB applied DEFAULT NOW()" on the read path.
func (r *NotificationRepository) Create(ctx context.Context, notif *domain.NotificationEvent) error {
if notif.ID == "" {
notif.ID = uuid.New().String()
}
if notif.CreatedAt.IsZero() {
notif.CreatedAt = time.Now()
}
err := r.db.QueryRowContext(ctx, `
INSERT INTO notification_events (
id, type, certificate_id, channel, recipient, message, sent_at, status, error
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
id, type, certificate_id, channel, recipient, message, sent_at, status, error, created_at
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
RETURNING id
`, notif.ID, notif.Type, notif.CertificateID, notif.Channel, notif.Recipient,
notif.Message, notif.SentAt, notif.Status, notif.Error).Scan(&notif.ID)
notif.Message, notif.SentAt, notif.Status, notif.Error, notif.CreatedAt).Scan(&notif.ID)
if err != nil {
return fmt.Errorf("failed to create notification: %w", err)
@@ -102,12 +120,14 @@ func (r *NotificationRepository) List(ctx context.Context, filter *repository.No
// Get paginated results. I-005 extends the SELECT with the three retry
// columns (retry_count / next_retry_at / last_error) so scanNotification
// can populate the new fields on domain.NotificationEvent. The column
// order here MUST stay in lockstep with scanNotification below.
// can populate the new fields on domain.NotificationEvent. U-3 extends
// it once more with `created_at` (column added by migration 000017) so
// the field is no longer serialized as 0001-01-01. The column order
// here MUST stay in lockstep with scanNotification below.
offset := (filter.Page - 1) * filter.PerPage
query := fmt.Sprintf(`
SELECT id, type, certificate_id, channel, recipient, message, sent_at, status, error,
retry_count, next_retry_at, last_error
retry_count, next_retry_at, last_error, created_at
FROM notification_events
%s
ORDER BY sent_at DESC NULLS LAST
@@ -162,8 +182,14 @@ func (r *NotificationRepository) UpdateStatus(ctx context.Context, id string, st
// scanNotification scans a notification from a row or rows.
//
// I-005 extends the scan list from 9 → 12 columns (adds retry_count,
// next_retry_at, last_error). Every caller — List and the four new retry
// I-005 extended the scan list from 9 → 12 columns (adds retry_count,
// next_retry_at, last_error). U-3 extends it once more to 13 columns by
// appending `created_at` (column added by migration 000017,
// cat-o-notification_created_at_dead_field). CreatedAt scans into a
// non-pointer time.Time because the migration declares the column
// NOT NULL with DEFAULT NOW().
//
// Every caller — List, ListRetryEligible, and the four other I-005 retry
// methods below — funnels rows through this helper, so the SELECT column
// order in every query must match the Scan order here exactly. RetryCount
// scans into an `int` (migration 000016 declares the column NOT NULL with
@@ -176,7 +202,7 @@ func scanNotification(scanner interface {
var notif domain.NotificationEvent
err := scanner.Scan(&notif.ID, &notif.Type, &notif.CertificateID, &notif.Channel,
&notif.Recipient, &notif.Message, &notif.SentAt, &notif.Status, &notif.Error,
&notif.RetryCount, &notif.NextRetryAt, &notif.LastError)
&notif.RetryCount, &notif.NextRetryAt, &notif.LastError, &notif.CreatedAt)
if err != nil {
return nil, fmt.Errorf("failed to scan notification: %w", err)
@@ -248,7 +274,7 @@ func (r *NotificationRepository) ListRetryEligible(ctx context.Context, now time
rows, err := r.db.QueryContext(ctx, `
SELECT id, type, certificate_id, channel, recipient, message, sent_at, status, error,
retry_count, next_retry_at, last_error
retry_count, next_retry_at, last_error, created_at
FROM notification_events
WHERE status = 'failed'
AND next_retry_at IS NOT NULL
@@ -339,6 +339,95 @@ func TestNotificationRepository_Requeue(t *testing.T) {
}
}
// TestNotificationRepository_CreatedAt_IsPersisted is the U-3 ride-along
// regression for cat-o-notification_created_at_dead_field. Pre-U-3 the
// Go domain.NotificationEvent had a CreatedAt field but the DB had no
// column — JSON serialisation produced 0001-01-01T00:00:00Z, breaking
// timestamp ordering on operator dashboards. Post-U-3 migration 000017
// adds the column NOT NULL DEFAULT NOW(), Create populates it, and
// scanNotification reads it back.
//
// The contract under test is round-trip equivalence: the timestamp the
// caller sets goes into the DB and comes back out unchanged (modulo
// PostgreSQL's microsecond precision). Truncate to microseconds before
// comparing because TIMESTAMPTZ rounds nanoseconds away.
func TestNotificationRepository_CreatedAt_IsPersisted(t *testing.T) {
tdb := getTestDB(t)
db := tdb.freshSchema(t)
repo := postgres.NewNotificationRepository(db)
ctx := context.Background()
// A specific, recognisable timestamp. Truncated to microseconds so
// the post-roundtrip equality assertion isn't tripped up by Postgres
// dropping the nanosecond tail.
want := time.Now().UTC().Add(-2 * time.Hour).Truncate(time.Microsecond)
notif := &domain.NotificationEvent{
Type: domain.NotificationTypeExpirationWarning,
Channel: domain.NotificationChannelWebhook,
Recipient: "https://hooks.example.com/u3",
Message: "U-3 round-trip witness",
Status: string(domain.NotificationStatusPending),
CreatedAt: want,
}
if err := repo.Create(ctx, notif); err != nil {
t.Fatalf("Create failed: %v", err)
}
// Re-read via List (which goes through scanNotification) so we're
// testing both the INSERT and SELECT halves of the U-3 plumbing.
got, err := repo.List(ctx, nil)
if err != nil {
t.Fatalf("List failed: %v", err)
}
if len(got) != 1 {
t.Fatalf("List returned %d rows, want 1", len(got))
}
if !got[0].CreatedAt.Equal(want) {
t.Errorf("CreatedAt round-trip mismatch:\n set: %v\n got: %v\n"+
"Pre-U-3 this would have come back as 0001-01-01 because the column didn't exist.",
want, got[0].CreatedAt)
}
}
// TestNotificationRepository_CreatedAt_DefaultsToNow verifies the helper
// behavior in Create: when the caller hands over an event with the
// zero-value CreatedAt, Create substitutes time.Now() rather than
// trusting the DB DEFAULT. This keeps wire-level JSON consistent with
// what the row will hold once it's read back, and avoids a clock-skew
// gap between "Go computed the timestamp" and "DB applied DEFAULT NOW()".
func TestNotificationRepository_CreatedAt_DefaultsToNow(t *testing.T) {
tdb := getTestDB(t)
db := tdb.freshSchema(t)
repo := postgres.NewNotificationRepository(db)
ctx := context.Background()
before := time.Now().UTC().Add(-time.Second)
notif := &domain.NotificationEvent{
Type: domain.NotificationTypeExpirationWarning,
Channel: domain.NotificationChannelWebhook,
Recipient: "https://hooks.example.com/zerotime",
Message: "U-3 zero-time fallback",
Status: string(domain.NotificationStatusPending),
// CreatedAt left zero on purpose — the contract is that Create
// fills it in from time.Now() when it's unset.
}
if err := repo.Create(ctx, notif); err != nil {
t.Fatalf("Create failed: %v", err)
}
after := time.Now().UTC().Add(time.Second)
if notif.CreatedAt.IsZero() {
t.Fatalf("CreatedAt is still zero after Create — the fallback in NotificationRepository.Create did not fire")
}
if notif.CreatedAt.Before(before) || notif.CreatedAt.After(after) {
t.Errorf("CreatedAt = %v is outside the [%v, %v] window — the substituted time.Now() should fall inside the test's wall-clock bracket",
notif.CreatedAt, before, after)
}
}
// ─── Helpers ──────────────────────────────────────────────────────────────
// past returns a stable "5 minutes ago" time for fixture seeding. Truncated
@@ -36,7 +36,7 @@ func NewRenewalPolicyRepository(db *sql.DB) *RenewalPolicyRepository {
// and require domain-layer churn we're not taking on in this change.
const renewalPolicyColumns = `
id, name, renewal_window_days, auto_renew, max_retries,
retry_interval_minutes, alert_thresholds_days, created_at, updated_at
retry_interval_seconds, alert_thresholds_days, created_at, updated_at
`
// scanRenewalPolicy decodes one renewal_policies row from a Row or Rows
@@ -170,7 +170,7 @@ func (r *RenewalPolicyRepository) Create(ctx context.Context, policy *domain.Ren
insertSQL := `
INSERT INTO renewal_policies (
id, name, renewal_window_days, auto_renew, max_retries,
retry_interval_minutes, alert_thresholds_days, created_at, updated_at
retry_interval_seconds, alert_thresholds_days, created_at, updated_at
) VALUES ($1, $2, $3, $4, $5, $6, $7, NOW(), NOW())
RETURNING ` + renewalPolicyColumns
@@ -240,7 +240,7 @@ func (r *RenewalPolicyRepository) Update(ctx context.Context, id string, policy
renewal_window_days = $3,
auto_renew = $4,
max_retries = $5,
retry_interval_minutes = $6,
retry_interval_seconds = $6,
alert_thresholds_days = $7,
updated_at = NOW()
WHERE id = $1
@@ -45,7 +45,7 @@ func TestRenewalPolicyRepository_CRUD(t *testing.T) {
RenewalWindowDays: 30,
AutoRenew: true,
MaxRetries: 5,
RetryInterval: 3600, // stored in retry_interval_minutes column; passthrough
RetryInterval: 3600, // stored as seconds in retry_interval_seconds column (renamed in 000017_db_coupling_cleanup, U-3)
AlertThresholdsDays: []int{30, 14, 7, 0},
}
+1 -1
View File
@@ -78,7 +78,7 @@ func insertCertPrereqsRaw(t *testing.T, db *sql.DB, ctx context.Context, suffix
}
// Create renewal policy
_, err = db.ExecContext(ctx, `INSERT INTO renewal_policies (id, name, renewal_window_days, auto_renew, max_retries, retry_interval_minutes, created_at, updated_at) VALUES ($1, $2, $3, $4, $5, $6, $7, $8)`,
_, err = db.ExecContext(ctx, `INSERT INTO renewal_policies (id, name, renewal_window_days, auto_renew, max_retries, retry_interval_seconds, created_at, updated_at) VALUES ($1, $2, $3, $4, $5, $6, $7, $8)`,
policyID, "Policy "+suffix, 30, true, 3, 60, now, now)
if err != nil {
t.Fatalf("insertCertPrereqs: create renewal_policy failed: %v", err)
+246
View File
@@ -0,0 +1,246 @@
// Integration tests for the U-3 schema-vs-seed coupling fix.
//
// Pre-U-3 the deploy compose stack mounted both a hand-curated subset of
// `migrations/*.up.sql` and `seed.sql` into postgres
// `/docker-entrypoint-initdb.d/`. Postgres applied them at initdb time.
// When `seed.sql` was updated to reference columns added by migrations
// *after* the mounted cutoff (e.g., `policy_rules.severity` from
// `000013_policy_rule_severity.up.sql`), initdb crashed during the seed
// step and the container was reported `unhealthy` indefinitely.
//
// Post-U-3 the schema is built EXCLUSIVELY by the server at startup via
// internal/repository/postgres.RunMigrations + RunSeed. These tests pin
// that contract: RunSeed must complete without error against a freshly
// migrated database, and re-application must be idempotent so server
// restarts don't double-insert.
//
// Skipped under -short to keep CI fast lanes green; the integration lane
// runs them via the testcontainers harness.
package postgres_test
import (
"context"
"database/sql"
"testing"
"github.com/shankar0123/certctl/internal/repository/postgres"
)
// TestRunSeed_AppliesIdempotently verifies the U-3 contract that RunSeed
// can be called repeatedly against a populated database without error and
// without producing duplicate rows. The server invokes RunSeed on EVERY
// boot (it has no migration-state table to skip from), so any non-
// idempotent INSERT in seed.sql would crash the container loop on the
// second start.
//
// The assertion uses renewal_policies.id='rp-default' as a witness — that
// row is the most-referenced FK target in the seed (it's the default
// renewal policy attached to every certificate that doesn't override).
// If the seed double-inserted, we'd see SQLSTATE 23505 from the second
// RunSeed call. If the seed silently ON CONFLICT-DO-NOTHING'd as
// designed, the row count stays at exactly 1.
func TestRunSeed_AppliesIdempotently(t *testing.T) {
tdb := getTestDB(t)
db := tdb.freshSchema(t)
ctx := context.Background()
migrationsPath := findMigrationsDir()
// Apply the seed twice — second call simulates a server restart on a
// populated database. Both must succeed; pre-U-3 the second call
// would fail with 23505 if any INSERT lacked ON CONFLICT.
if err := postgres.RunSeed(db, migrationsPath); err != nil {
t.Fatalf("RunSeed (first call) returned error: %v", err)
}
if err := postgres.RunSeed(db, migrationsPath); err != nil {
t.Fatalf("RunSeed (second call — idempotency check) returned error: %v\n"+
"This means the seed produced a duplicate row; every INSERT in seed.sql "+
"must use ON CONFLICT (id) DO NOTHING because the server applies the "+
"seed on EVERY start.", err)
}
// Witness check: rp-default is the renewal policy every cert defaults
// to. Exactly one row must exist after two seed applications.
var count int
err := db.QueryRowContext(ctx,
`SELECT COUNT(*) FROM renewal_policies WHERE id = 'rp-default'`,
).Scan(&count)
if err != nil {
t.Fatalf("witness query failed: %v", err)
}
if count != 1 {
t.Errorf("renewal_policies WHERE id='rp-default' returned %d rows after two RunSeed calls; want exactly 1 (ON CONFLICT idempotency contract)", count)
}
}
// TestRunSeed_MissingFileIsNoOp verifies the fail-soft contract documented
// on RunSeed: an operator who deletes seed.sql for custom packaging (CI
// pipelines that bake their own seeds, cert-manager managed deployments)
// must still get a healthy server boot. RunSeed returning nil for a
// missing file is the only way to hold this contract — returning an error
// would force every minimal-image deployment to ship the seed file just
// to satisfy a no-op load.
//
// We point at a directory that exists (empty temp dir) but contains no
// seed.sql. RunSeed must return nil silently.
func TestRunSeed_MissingFileIsNoOp(t *testing.T) {
if testing.Short() {
t.Skip("skipping integration test in short mode")
}
// Use a brand-new empty directory so seed.sql is unambiguously absent.
emptyDir := t.TempDir()
// Pass a nil *sql.DB on purpose — RunSeed must short-circuit on the
// missing file BEFORE touching the DB. If the implementation ever
// regresses and tries to db.Exec(string(content)) with nil content,
// this will surface as a nil-deref instead of a silent corruption.
var db *sql.DB
if err := postgres.RunSeed(db, emptyDir); err != nil {
t.Fatalf("RunSeed against an empty directory should return nil; got: %v", err)
}
}
// TestRunDemoSeed_AppliesIdempotently mirrors the RunSeed idempotency
// contract for the demo overlay. The compose demo stack
// (deploy/docker-compose.demo.yml) sets CERTCTL_DEMO_SEED=true; the
// server applies seed_demo.sql at every boot. Same constraint as the
// baseline seed: if any INSERT lacks ON CONFLICT, the server will
// crash-loop on restart.
//
// Witness: seed_demo.sql inserts t-platform into the teams table at line
// 11. That row is referenced by every demo-team-owned certificate, so
// duplicate-insertion would block the entire demo on restart.
func TestRunDemoSeed_AppliesIdempotently(t *testing.T) {
tdb := getTestDB(t)
db := tdb.freshSchema(t)
ctx := context.Background()
migrationsPath := findMigrationsDir()
// Order matters — RunSeed must run first so the FK targets the demo
// seed depends on (rp-* renewal policies, etc.) exist before the
// demo INSERTs run. This mirrors the order in cmd/server/main.go.
if err := postgres.RunSeed(db, migrationsPath); err != nil {
t.Fatalf("RunSeed prerequisite failed: %v", err)
}
if err := postgres.RunDemoSeed(db, migrationsPath); err != nil {
t.Fatalf("RunDemoSeed (first call) returned error: %v", err)
}
if err := postgres.RunDemoSeed(db, migrationsPath); err != nil {
t.Fatalf("RunDemoSeed (second call — idempotency check) returned error: %v", err)
}
var count int
err := db.QueryRowContext(ctx,
`SELECT COUNT(*) FROM teams WHERE id = 't-platform'`,
).Scan(&count)
if err != nil {
t.Fatalf("witness query failed: %v", err)
}
if count != 1 {
t.Errorf("teams WHERE id='t-platform' returned %d rows after two RunDemoSeed calls; want exactly 1", count)
}
}
// TestMigration000017_RetryIntervalRename verifies the U-3 ride-along
// column rename: renewal_policies.retry_interval_minutes →
// retry_interval_seconds (cat-o-retry_interval_unit_mismatch). The unit
// was always seconds in practice — the column name lied. Migration 000017
// renames the column with a DO $$ guard so re-application is safe.
//
// After all migrations have been applied (which the test harness does in
// freshSchema), the new column must exist and the old column must NOT.
// information_schema.columns is the source of truth for both checks.
func TestMigration000017_RetryIntervalRename(t *testing.T) {
tdb := getTestDB(t)
db := tdb.freshSchema(t)
ctx := context.Background()
// Helper — true iff the named column exists on renewal_policies.
hasColumn := func(name string) bool {
t.Helper()
var n int
err := db.QueryRowContext(ctx, `
SELECT COUNT(*) FROM information_schema.columns
WHERE table_name = 'renewal_policies' AND column_name = $1
`, name).Scan(&n)
if err != nil {
t.Fatalf("information_schema query for column %q failed: %v", name, err)
}
return n > 0
}
if !hasColumn("retry_interval_seconds") {
t.Error("renewal_policies.retry_interval_seconds is missing — migration 000017 did not apply, or it was applied before the rename block")
}
if hasColumn("retry_interval_minutes") {
t.Error("renewal_policies.retry_interval_minutes still exists — the rename in migration 000017 must drop the old name (cat-o-retry_interval_unit_mismatch)")
}
}
// TestMigration000017_NotificationCreatedAt verifies the U-3 ride-along
// column add: notification_events.created_at NOT NULL DEFAULT NOW()
// (cat-o-notification_created_at_dead_field). Pre-U-3 the Go domain had
// the field but the DB lacked the column, so the JSON API serialised
// 0001-01-01.
func TestMigration000017_NotificationCreatedAt(t *testing.T) {
tdb := getTestDB(t)
db := tdb.freshSchema(t)
ctx := context.Background()
var dataType, isNullable, columnDefault sql.NullString
err := db.QueryRowContext(ctx, `
SELECT data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'notification_events' AND column_name = 'created_at'
`).Scan(&dataType, &isNullable, &columnDefault)
if err != nil {
t.Fatalf("information_schema query for created_at failed: %v\n"+
"Migration 000017 should have added notification_events.created_at TIMESTAMPTZ NOT NULL DEFAULT NOW().", err)
}
if dataType.String != "timestamp with time zone" {
t.Errorf("notification_events.created_at data_type = %q, want %q",
dataType.String, "timestamp with time zone")
}
if isNullable.String != "NO" {
t.Errorf("notification_events.created_at is_nullable = %q, want NO (the column must be NOT NULL so legacy rows get the DEFAULT)",
isNullable.String)
}
if columnDefault.String == "" {
t.Error("notification_events.created_at has no DEFAULT — legacy rows added before migration 000017 would fail the NOT NULL gate without one")
}
}
// TestMigration000017_HealthCheckOrphansDropped verifies the U-3
// ride-along column drop: network_scan_targets lost the orphan
// health_check_enabled / health_check_interval_seconds columns
// (cat-o-health_check_column_orphans). These were declared by an early
// migration but never wired into Go code — schema noise that confused
// operators reading raw SQL. Migration 000017 drops them.
func TestMigration000017_HealthCheckOrphansDropped(t *testing.T) {
tdb := getTestDB(t)
db := tdb.freshSchema(t)
ctx := context.Background()
hasColumn := func(name string) bool {
t.Helper()
var n int
err := db.QueryRowContext(ctx, `
SELECT COUNT(*) FROM information_schema.columns
WHERE table_name = 'network_scan_targets' AND column_name = $1
`, name).Scan(&n)
if err != nil {
t.Fatalf("information_schema query for column %q failed: %v", name, err)
}
return n > 0
}
for _, col := range []string{"health_check_enabled", "health_check_interval_seconds"} {
if hasColumn(col) {
t.Errorf("network_scan_targets.%s still exists — migration 000017 must drop it (cat-o-health_check_column_orphans)", col)
}
}
}