Files
certctl/internal/repository/postgres/renewal_policy.go
T
shankar0123 a3d8b9c607 fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master)
GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the
canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build);
postgres reported unhealthy indefinitely, dependent containers never started.

Root cause: deploy/docker-compose.yml mounted a hand-curated subset of
migrations/*.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/.
Postgres applied them at initdb time. Once seed.sql referenced columns added
by migrations *after* the mounted cutoff (e.g., policy_rules.severity from
migration 000013), initdb crashed mid-seed and the container loop wedged.
Two sources of truth (compose mount list vs in-tree migration ladder)
diverged the moment a seed-touching migration shipped, and the only thing
that fixed it was hand-editing the compose file every release.

Fix: remove the dual source. Postgres boots empty; the server applies
migrations + seed at startup via RunMigrations + RunSeed. Helm has used
this pattern since day one (postgres-init emptyDir); compose now matches.

Bundled with four ride-along audit findings whose fixes share the same
schema/db code surface, so operators take the schema-change pain only once:

  cat-u-seed_initdb_schema_drift           [P1, primary] — initdb-mount fix
  cat-o-retry_interval_unit_mismatch       [P1] — column rename minutes→seconds
  cat-o-notification_created_at_dead_field [P2] — add column + populate
  cat-o-health_check_column_orphans        [P1] — drop unwired columns
  cat-u-no_version_endpoint                [P2] — add /api/v1/version

Single migration (000017_db_coupling_cleanup) bundles the three schema
changes under a DO \$\$ guard so re-application is safe; reduces
operator-visible 'schema-change releases' from four to one.

Backend
- internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed
  (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in
  every shipped INSERT) so repeated boots are safe; missing-file is no-op
  so custom packaging that strips seeds still boots cleanly.
- cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set)
  immediately after RunMigrations.
- internal/repository/postgres/notification.go: NotificationRepository.Create
  now sets created_at (with time.Now() fallback when caller leaves it zero);
  scanNotification reads it back; List + ListRetryEligible SELECT extended.
- internal/repository/postgres/renewal_policy.go: column references updated
  to retry_interval_seconds across SELECT/INSERT/UPDATE sites.
- internal/api/handler/version.go: new VersionHandler exposes
  {version, commit, modified, build_time, go_version} from
  runtime/debug.ReadBuildInfo() with ldflags-supplied Version override.
- internal/api/router/router.go: register GET /api/v1/version through the
  no-auth chain (CORS + ContentType) alongside /health, /ready,
  /api/v1/auth/info.
- cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit
  ExcludePaths so rollout polling doesn't dominate the audit trail.
- internal/config/config.go: add DatabaseConfig.DemoSeed +
  CERTCTL_DEMO_SEED env var.

Migration
- migrations/000017_db_coupling_cleanup.up.sql + .down.sql:
    (1) renewal_policies.retry_interval_minutes → retry_interval_seconds
        (DO \$\$ guard, idempotent re-application)
    (2) notification_events ADD COLUMN created_at TIMESTAMPTZ
        NOT NULL DEFAULT NOW()
    (3) network_scan_targets DROP orphan health_check_enabled +
        health_check_interval_seconds
- migrations/seed.sql: column reference updated to retry_interval_seconds.
- migrations/seed_demo.sql: same column rename + applied at runtime now via
  RunDemoSeed (no longer initdb-mounted).

Compose
- deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files +
  seed.sql); add start_period: 30s to postgres + certctl-server healthchecks
  to absorb the runtime migration + seed application window on first boot.
- deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount
  removed; that file never existed); same healthcheck start_period.
- deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with
  CERTCTL_DEMO_SEED=true env var on certctl-server.

Tests
- internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo,
  TestVersion_RejectsNonGet, TestVersion_LdflagsOverride.
- internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently,
  TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently,
  TestMigration000017_RetryIntervalRename,
  TestMigration000017_NotificationCreatedAt,
  TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips).
- internal/repository/postgres/notification_test.go:
  TestNotificationRepository_CreatedAt_IsPersisted +
  TestNotificationRepository_CreatedAt_DefaultsToNow.

CI guardrail
- .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb
  (U-3)' step grep-fails the build if any migrations/*.sql or seed*.sql
  re-appears in /docker-entrypoint-initdb.d in any compose file. Catches
  future drift before a fresh-clone operator hits it.

Spec / Docs
- api/openapi.yaml: add /api/v1/version operation under Health tag.
- docs/architecture.md: replace the 'initdb may run the same SQL' paragraph
  with a post-U-3 single-source-of-truth explanation.
- CHANGELOG.md: full unreleased-section entry covering all 5 closures,
  breaking changes, and the new env var.

Audit doc
- coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14
  cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to
   RESOLVED with closure prose pointing at this commit.

Verification: build/vet/test -short -race all clean across all touched
packages locally; govulncheck reports 0 vulnerabilities affecting our
code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the
post-fix tree.
2026-04-25 13:29:23 +00:00

290 lines
9.9 KiB
Go

package postgres
import (
"context"
"database/sql"
"encoding/json"
"errors"
"fmt"
"regexp"
"strings"
"github.com/lib/pq"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/repository"
)
// RenewalPolicyRepository implements repository.RenewalPolicyRepository.
type RenewalPolicyRepository struct {
db *sql.DB
}
// NewRenewalPolicyRepository creates a new RenewalPolicyRepository.
func NewRenewalPolicyRepository(db *sql.DB) *RenewalPolicyRepository {
return &RenewalPolicyRepository{db: db}
}
// SELECT column order is the shared contract between scanRenewalPolicy and
// every SELECT/RETURNING in this file. Keep them in lockstep; if you add a
// new column, add it to all SELECTs, all scan calls, and scanRenewalPolicy.
//
// Note: certificate_profile_id and agent_group_id live on renewal_policies
// (migrations 000003 and 000004) but are deliberately NOT read here — that
// pre-existing drift is out of G-1's minimum-viable-delta and is tracked in
// the design doc §8. Introducing them would change struct shapes / JSON tags
// and require domain-layer churn we're not taking on in this change.
const renewalPolicyColumns = `
id, name, renewal_window_days, auto_renew, max_retries,
retry_interval_seconds, alert_thresholds_days, created_at, updated_at
`
// scanRenewalPolicy decodes one renewal_policies row from a Row or Rows
// scanner, unmarshaling alert_thresholds_days JSONB into the domain slice.
// Malformed JSONB silently falls back to DefaultAlertThresholds() — same
// behavior as the pre-G-1 code so we don't change observable semantics.
func scanRenewalPolicy(scanner interface {
Scan(dest ...any) error
}) (*domain.RenewalPolicy, error) {
var policy domain.RenewalPolicy
var thresholdsJSON []byte
if err := scanner.Scan(
&policy.ID, &policy.Name, &policy.RenewalWindowDays, &policy.AutoRenew,
&policy.MaxRetries, &policy.RetryInterval, &thresholdsJSON,
&policy.CreatedAt, &policy.UpdatedAt,
); err != nil {
return nil, err
}
if len(thresholdsJSON) > 0 {
if err := json.Unmarshal(thresholdsJSON, &policy.AlertThresholdsDays); err != nil {
policy.AlertThresholdsDays = domain.DefaultAlertThresholds()
}
}
return &policy, nil
}
// Get retrieves a renewal policy by ID.
func (r *RenewalPolicyRepository) Get(ctx context.Context, id string) (*domain.RenewalPolicy, error) {
row := r.db.QueryRowContext(ctx, `SELECT `+renewalPolicyColumns+` FROM renewal_policies WHERE id = $1`, id)
policy, err := scanRenewalPolicy(row)
if err != nil {
if errors.Is(err, sql.ErrNoRows) {
return nil, fmt.Errorf("renewal policy not found: %s", id)
}
return nil, fmt.Errorf("failed to query renewal policy: %w", err)
}
return policy, nil
}
// List returns all renewal policies, ordered by name (matches the index on
// renewal_policies.name from migration 000001 so ORDER BY is index-served).
func (r *RenewalPolicyRepository) List(ctx context.Context) ([]*domain.RenewalPolicy, error) {
rows, err := r.db.QueryContext(ctx, `SELECT `+renewalPolicyColumns+` FROM renewal_policies ORDER BY name`)
if err != nil {
return nil, fmt.Errorf("failed to query renewal policies: %w", err)
}
defer rows.Close()
var policies []*domain.RenewalPolicy
for rows.Next() {
policy, err := scanRenewalPolicy(rows)
if err != nil {
return nil, fmt.Errorf("failed to scan renewal policy: %w", err)
}
policies = append(policies, policy)
}
if err := rows.Err(); err != nil {
return nil, fmt.Errorf("error iterating renewal policy rows: %w", err)
}
return policies, nil
}
// slugRegex matches non-alphanumeric characters that slugifyPolicyName strips.
var slugRegex = regexp.MustCompile(`[^a-z0-9-]+`)
// slugifyPolicyName produces `rp-<slug>` for an auto-generated policy ID.
// Slug: lowercase, spaces→hyphens, non-alphanumeric stripped, trimmed to 64
// chars. Matches the existing seed convention (rp-default, rp-standard,
// rp-urgent). Collision resolution is handled by Create's retry loop.
func slugifyPolicyName(name string) string {
slug := strings.ToLower(strings.TrimSpace(name))
slug = strings.ReplaceAll(slug, " ", "-")
slug = slugRegex.ReplaceAllString(slug, "")
slug = strings.Trim(slug, "-")
if slug == "" {
slug = "policy"
}
if len(slug) > 64 {
slug = slug[:64]
}
return "rp-" + slug
}
// isUniqueViolation reports whether err is a PostgreSQL 23505 unique_violation.
// Used by Create/Update to translate name-collision errors onto the typed
// ErrRenewalPolicyDuplicateName sentinel.
func isUniqueViolation(err error) bool {
var pqErr *pq.Error
return errors.As(err, &pqErr) && pqErr.Code == "23505"
}
// isForeignKeyViolation reports whether err is a PostgreSQL 23503
// foreign_key_violation. Used by Delete to translate ON DELETE RESTRICT
// failures onto the typed ErrRenewalPolicyInUse sentinel.
func isForeignKeyViolation(err error) bool {
var pqErr *pq.Error
return errors.As(err, &pqErr) && pqErr.Code == "23503"
}
// Create inserts a new renewal policy. If policy.ID is empty, auto-generates
// `rp-<slug(name)>` with -2/-3/... suffixes on collision (up to 10 attempts).
// Returns ErrRenewalPolicyDuplicateName on pg 23505 (name collision).
//
// alert_thresholds_days is marshaled to JSONB here rather than relying on the
// DB default because the service layer already applies DefaultAlertThresholds
// for empty input — the DB default is a safety net, not the primary path.
func (r *RenewalPolicyRepository) Create(ctx context.Context, policy *domain.RenewalPolicy) error {
if policy == nil {
return errors.New("renewal policy is nil")
}
thresholdsJSON, err := json.Marshal(policy.AlertThresholdsDays)
if err != nil {
return fmt.Errorf("failed to marshal alert thresholds: %w", err)
}
// ID auto-generation with collision retry. We attempt up to 10 suffix
// variants (rp-foo, rp-foo-2, ..., rp-foo-10) before giving up — the
// 23505 error the caller gets back past that point is on Name (their
// job to fix) rather than on a slug-collision we swallowed.
baseID := policy.ID
if baseID == "" {
baseID = slugifyPolicyName(policy.Name)
}
insertSQL := `
INSERT INTO renewal_policies (
id, name, renewal_window_days, auto_renew, max_retries,
retry_interval_seconds, alert_thresholds_days, created_at, updated_at
) VALUES ($1, $2, $3, $4, $5, $6, $7, NOW(), NOW())
RETURNING ` + renewalPolicyColumns
maxAttempts := 10
if policy.ID != "" {
// Caller supplied a specific ID — no collision-retry, just one shot.
maxAttempts = 1
}
for attempt := 1; attempt <= maxAttempts; attempt++ {
candidateID := baseID
if attempt > 1 {
candidateID = fmt.Sprintf("%s-%d", baseID, attempt)
}
row := r.db.QueryRowContext(ctx, insertSQL,
candidateID, policy.Name, policy.RenewalWindowDays, policy.AutoRenew,
policy.MaxRetries, policy.RetryInterval, thresholdsJSON,
)
inserted, scanErr := scanRenewalPolicy(row)
if scanErr == nil {
*policy = *inserted
return nil
}
if isUniqueViolation(scanErr) {
// Determine which unique constraint — if it's the name UNIQUE
// we can't recover (caller has to pick a new name); if it's the
// primary-key slug collision we loop to the next suffix.
var pqErr *pq.Error
errors.As(scanErr, &pqErr)
// Postgres reports the constraint name in pqErr.Constraint;
// renewal_policies_name_key is the name UNIQUE, renewal_policies_pkey
// is the PK. Name collision is terminal, PK collision is retryable.
if pqErr.Constraint != "" && !strings.Contains(pqErr.Constraint, "pkey") {
return repository.ErrRenewalPolicyDuplicateName
}
// PK collision — try next suffix.
continue
}
return fmt.Errorf("failed to insert renewal policy: %w", scanErr)
}
// Exhausted retry budget on PK collisions — surface as duplicate so the
// caller at least gets a 409 rather than a mysterious 500.
return repository.ErrRenewalPolicyDuplicateName
}
// Update modifies an existing renewal policy by ID. Returns an error wrapping
// sql.ErrNoRows when id is unknown (detected by RETURNING returning zero rows),
// or ErrRenewalPolicyDuplicateName on pg 23505 (name collision with another row).
func (r *RenewalPolicyRepository) Update(ctx context.Context, id string, policy *domain.RenewalPolicy) error {
if policy == nil {
return errors.New("renewal policy is nil")
}
thresholdsJSON, err := json.Marshal(policy.AlertThresholdsDays)
if err != nil {
return fmt.Errorf("failed to marshal alert thresholds: %w", err)
}
row := r.db.QueryRowContext(ctx, `
UPDATE renewal_policies SET
name = $2,
renewal_window_days = $3,
auto_renew = $4,
max_retries = $5,
retry_interval_seconds = $6,
alert_thresholds_days = $7,
updated_at = NOW()
WHERE id = $1
RETURNING `+renewalPolicyColumns,
id, policy.Name, policy.RenewalWindowDays, policy.AutoRenew,
policy.MaxRetries, policy.RetryInterval, thresholdsJSON,
)
updated, err := scanRenewalPolicy(row)
if err != nil {
if errors.Is(err, sql.ErrNoRows) {
return fmt.Errorf("renewal policy not found: %s", id)
}
if isUniqueViolation(err) {
return repository.ErrRenewalPolicyDuplicateName
}
return fmt.Errorf("failed to update renewal policy: %w", err)
}
*policy = *updated
return nil
}
// Delete removes a renewal policy by ID. Returns ErrRenewalPolicyInUse when
// the policy is still referenced by rows in managed_certificates (pg 23503
// foreign_key_violation against the ON DELETE RESTRICT FK from
// managed_certificates.renewal_policy_id). Returns an error wrapping
// sql.ErrNoRows when id is unknown.
func (r *RenewalPolicyRepository) Delete(ctx context.Context, id string) error {
result, err := r.db.ExecContext(ctx, `DELETE FROM renewal_policies WHERE id = $1`, id)
if err != nil {
if isForeignKeyViolation(err) {
return repository.ErrRenewalPolicyInUse
}
return fmt.Errorf("failed to delete renewal policy: %w", err)
}
rows, err := result.RowsAffected()
if err != nil {
return fmt.Errorf("failed to read RowsAffected for delete: %w", err)
}
if rows == 0 {
return fmt.Errorf("renewal policy not found: %s", id)
}
return nil
}