certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 20:51:30 +00:00

Author	SHA1	Message	Date
shankar0123	a3d8b9c607	fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master) GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build); postgres reported unhealthy indefinitely, dependent containers never started. Root cause: deploy/docker-compose.yml mounted a hand-curated subset of migrations/.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/. Postgres applied them at initdb time. Once seed.sql referenced columns added by migrations after* the mounted cutoff (e.g., policy_rules.severity from migration 000013), initdb crashed mid-seed and the container loop wedged. Two sources of truth (compose mount list vs in-tree migration ladder) diverged the moment a seed-touching migration shipped, and the only thing that fixed it was hand-editing the compose file every release. Fix: remove the dual source. Postgres boots empty; the server applies migrations + seed at startup via RunMigrations + RunSeed. Helm has used this pattern since day one (postgres-init emptyDir); compose now matches. Bundled with four ride-along audit findings whose fixes share the same schema/db code surface, so operators take the schema-change pain only once: cat-u-seed_initdb_schema_drift [P1, primary] — initdb-mount fix cat-o-retry_interval_unit_mismatch [P1] — column rename minutes→seconds cat-o-notification_created_at_dead_field [P2] — add column + populate cat-o-health_check_column_orphans [P1] — drop unwired columns cat-u-no_version_endpoint [P2] — add /api/v1/version Single migration (000017_db_coupling_cleanup) bundles the three schema changes under a DO \$\$ guard so re-application is safe; reduces operator-visible 'schema-change releases' from four to one. Backend - internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in every shipped INSERT) so repeated boots are safe; missing-file is no-op so custom packaging that strips seeds still boots cleanly. - cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set) immediately after RunMigrations. - internal/repository/postgres/notification.go: NotificationRepository.Create now sets created_at (with time.Now() fallback when caller leaves it zero); scanNotification reads it back; List + ListRetryEligible SELECT extended. - internal/repository/postgres/renewal_policy.go: column references updated to retry_interval_seconds across SELECT/INSERT/UPDATE sites. - internal/api/handler/version.go: new VersionHandler exposes {version, commit, modified, build_time, go_version} from runtime/debug.ReadBuildInfo() with ldflags-supplied Version override. - internal/api/router/router.go: register GET /api/v1/version through the no-auth chain (CORS + ContentType) alongside /health, /ready, /api/v1/auth/info. - cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit ExcludePaths so rollout polling doesn't dominate the audit trail. - internal/config/config.go: add DatabaseConfig.DemoSeed + CERTCTL_DEMO_SEED env var. Migration - migrations/000017_db_coupling_cleanup.up.sql + .down.sql: (1) renewal_policies.retry_interval_minutes → retry_interval_seconds (DO \$\$ guard, idempotent re-application) (2) notification_events ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() (3) network_scan_targets DROP orphan health_check_enabled + health_check_interval_seconds - migrations/seed.sql: column reference updated to retry_interval_seconds. - migrations/seed_demo.sql: same column rename + applied at runtime now via RunDemoSeed (no longer initdb-mounted). Compose - deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files + seed.sql); add start_period: 30s to postgres + certctl-server healthchecks to absorb the runtime migration + seed application window on first boot. - deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount removed; that file never existed); same healthcheck start_period. - deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with CERTCTL_DEMO_SEED=true env var on certctl-server. Tests - internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo, TestVersion_RejectsNonGet, TestVersion_LdflagsOverride. - internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently, TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently, TestMigration000017_RetryIntervalRename, TestMigration000017_NotificationCreatedAt, TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips). - internal/repository/postgres/notification_test.go: TestNotificationRepository_CreatedAt_IsPersisted + TestNotificationRepository_CreatedAt_DefaultsToNow. CI guardrail - .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb (U-3)' step grep-fails the build if any migrations/.sql or seed.sql re-appears in /docker-entrypoint-initdb.d in any compose file. Catches future drift before a fresh-clone operator hits it. Spec / Docs - api/openapi.yaml: add /api/v1/version operation under Health tag. - docs/architecture.md: replace the 'initdb may run the same SQL' paragraph with a post-U-3 single-source-of-truth explanation. - CHANGELOG.md: full unreleased-section entry covering all 5 closures, breaking changes, and the new env var. Audit doc - coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14 cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to ✅ RESOLVED with closure prose pointing at this commit. Verification: build/vet/test -short -race all clean across all touched packages locally; govulncheck reports 0 vulnerabilities affecting our code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the post-fix tree.	2026-04-25 13:29:23 +00:00
shankar0123	a91197014f	fix(db): emit volume-state guidance on postgres auth failure (U-1, #10 ) The shipped quickstart instructs operators to copy deploy/.env.example to deploy/.env, edit POSTGRES_PASSWORD, and run docker compose up. On the first boot of a fresh checkout this works. On the second boot — i.e., when an operator first booted with the default POSTGRES_PASSWORD=certctl, then edited .env and re-ran up — the certctl-server container picks up the new password (env interpolated at every container start) but postgres does not. The postgres docker-entrypoint runs initdb only when the data dir is empty; on subsequent boots the persistent named volume postgres_data is non-empty so pg_authid retains the password baked in on first boot. The server connects with the new credentials, postgres rejects them, and the operator sees an opaque `pq: password authentication failed for user "certctl"` in the server log with no pointer to the actual cause. New- operator onboarding gets blocked on the documented production path. Why a doc fix alone is not sufficient. Operators don't reread the docs after a successful first boot — the trap fires on the second up, when they think they've already learned the system. The opaque pq error is indistinguishable in the log from a typo'd password or a misconfigured secret store. The diagnostic has to fire at the moment the failure is observed. Why we don't try to fix the bootstrap. The env-vs-pg_authid divergence is intrinsic to how the official postgres image bootstraps (see docker-entrypoint.sh: initdb runs only if PGDATA is empty). Switching to a bind mount or ephemeral volume breaks the production path; switching to POSTGRES_PASSWORD_FILE + ALTER ROLE adds operator surface without eliminating the divergence. The ergonomic fix is to surface the failure mode loudly, with both remediation paths, at the exact log line where it becomes visible. Two remediation paths, surfaced together. Destructive: `docker compose -f deploy/docker-compose.yml down -v && up -d --build` — wipes the postgres volume so initdb re-runs with the new env value. Use this on demos / first-time setup where data loss is acceptable. Non-destructive: `docker compose exec postgres psql -U certctl -c "ALTER ROLE certctl PASSWORD '<new>';"` followed by a server restart with the matching POSTGRES_PASSWORD. Use this on any environment that holds data you want to keep. Surfacing both means the operator can pick based on their environment without us assuming. Files changed: - internal/repository/postgres/db.go — extract wrapPingError(err) helper. errors.As against pq.Error; on SQLSTATE 28P01 (invalid_password) emit the multi-line guidance preserving the %w wrap chain. Non-28P01 errors retain the original `failed to ping database: %w` shape so transient connection-refused / timeout paths don't get noisy. Add pgErrInvalidPassword = "28P01" constant. Convert blank `_ "github.com/lib/pq"` import to direct import (driver registration still works via init()) so we can name the pq.Error type at compile time. NewDB now calls wrapPingError(err) instead of inlining the wrap. - internal/repository/postgres/db_test.go (new) — 4 internal-package unit tests covering wrapPingError. AuthFailureGuidance pins the contract substrings ("SQLSTATE 28P01", "POSTGRES_PASSWORD", "first boot", "down -v", "ALTER ROLE"). NonAuthErrorPreservesOriginalWrap pins the no-leak contract for SQLSTATE 08006 (connection_failure). NonPqErrorPreservesOriginalWrap pins the network-level path. NilReturnsNil pins defensive contract. All run in -short without testcontainers — package postgres (internal) so the unexported helper is callable directly. - docs/quickstart.md — `> Warning:` callout immediately after the `cp deploy/.env.example deploy/.env` block at lines 56-61. Names the trap, names the SQLSTATE, gives both remediation paths. Uses the in-file `> Note:` blockquote convention. - deploy/ENVIRONMENTS.md — `Stateful volume — first-boot password binding (U-1)` paragraph appended to the Postgres expert-note block. Explains the env-vs-pg_authid divergence, points at wrapPingError as the runtime diagnostic, lists both remediation paths. Uses the in-file `Expert note:` convention. Out of scope (separate follow-ups): - deploy/helm/certctl/templates/postgres-statefulset.yaml has the same root cause via PVC retention. The wrapPingError diagnostic covers the Helm path because the same NewDB code runs at server startup; the Helm-specific doc warning lands separately. - /.env.example at repo root (line 16 hardcodes the password literally inside CERTCTL_DATABASE_URL rather than interpolating) — adjacent trap, separate fix. - examples/{acme-nginx,private-ca-traefik,step-ca-haproxy,multi-issuer, acme-wildcard-dns01}/docker-compose.yml all carry the pattern. The diagnostic covers them; targeted doc warnings are scoped to the canonical quickstart + ENVIRONMENTS docs. Out of consideration: - Switch to bind mount / ephemeral volume — breaks the production path. - POSTGRES_PASSWORD_FILE + Docker secret + ALTER ROLE rotation — adds operator surface without fixing the env-vs-pg_authid divergence. Verification (all passing): - go build ./... - go vet ./... - go test -short -race ./internal/repository/postgres/ — 4/4 new tests pass plus existing tests - go test -short ./... — every package green - govulncheck ./... — no vulnerabilities in our code - wrapPingError coverage 100%; postgres pkg total unchanged in shape (NewDB/RunMigrations were 0% pre-fix, still 0% post-fix; new helper adds 100%-covered statements) Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap GitHub Issue #10 (mikeakasully)	2026-04-24 23:21:26 +00:00
shankar0123	8054719956	fix: migration runner only executes .up.sql files, skips .down.sql and seeds The migration runner was collecting all .sql files alphabetically, which caused .down.sql rollback files (DROP TABLE) to execute before .up.sql files on restart with a persisted postgres volume. Filter to only .up.sql files — these are idempotent (IF NOT EXISTS) and safe to re-run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 12:08:12 -04:00
shankar0123	3a9fe8ba37	Complete V1 scaffold	2026-03-14 20:01:53 -04:00

4 Commits