mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 13:31:36 +00:00
a91197014f
The shipped quickstart instructs operators to copy deploy/.env.example to
deploy/.env, edit POSTGRES_PASSWORD, and run docker compose up. On the
*first* boot of a fresh checkout this works. On the *second* boot — i.e.,
when an operator first booted with the default POSTGRES_PASSWORD=certctl,
then edited .env and re-ran up — the certctl-server container picks up the
new password (env interpolated at every container start) but postgres does
not. The postgres docker-entrypoint runs initdb only when the data dir is
empty; on subsequent boots the persistent named volume postgres_data is
non-empty so pg_authid retains the password baked in on first boot. The
server connects with the new credentials, postgres rejects them, and the
operator sees an opaque `pq: password authentication failed for user
"certctl"` in the server log with no pointer to the actual cause. New-
operator onboarding gets blocked on the documented production path.
Why a doc fix alone is not sufficient. Operators don't reread the docs
after a successful first boot — the trap fires on the *second* up, when
they think they've already learned the system. The opaque pq error is
indistinguishable in the log from a typo'd password or a misconfigured
secret store. The diagnostic has to fire at the moment the failure is
observed.
Why we don't try to fix the bootstrap. The env-vs-pg_authid divergence is
intrinsic to how the official postgres image bootstraps (see
docker-entrypoint.sh: initdb runs only if PGDATA is empty). Switching to a
bind mount or ephemeral volume breaks the production path; switching to
POSTGRES_PASSWORD_FILE + ALTER ROLE adds operator surface without
eliminating the divergence. The ergonomic fix is to surface the failure
mode loudly, with both remediation paths, at the exact log line where it
becomes visible.
Two remediation paths, surfaced together. Destructive: `docker compose
-f deploy/docker-compose.yml down -v && up -d --build` — wipes the
postgres volume so initdb re-runs with the new env value. Use this on
demos / first-time setup where data loss is acceptable. Non-destructive:
`docker compose exec postgres psql -U certctl -c "ALTER ROLE certctl
PASSWORD '<new>';"` followed by a server restart with the matching
POSTGRES_PASSWORD. Use this on any environment that holds data you want
to keep. Surfacing both means the operator can pick based on their
environment without us assuming.
Files changed:
- internal/repository/postgres/db.go — extract wrapPingError(err) helper.
errors.As against *pq.Error; on SQLSTATE 28P01 (invalid_password) emit
the multi-line guidance preserving the %w wrap chain. Non-28P01 errors
retain the original `failed to ping database: %w` shape so transient
connection-refused / timeout paths don't get noisy. Add
pgErrInvalidPassword = "28P01" constant. Convert blank
`_ "github.com/lib/pq"` import to direct import (driver registration
still works via init()) so we can name the *pq.Error type at compile
time. NewDB now calls wrapPingError(err) instead of inlining the wrap.
- internal/repository/postgres/db_test.go (new) — 4 internal-package
unit tests covering wrapPingError. AuthFailureGuidance pins the
contract substrings ("SQLSTATE 28P01", "POSTGRES_PASSWORD",
"first boot", "down -v", "ALTER ROLE"). NonAuthErrorPreservesOriginalWrap
pins the no-leak contract for SQLSTATE 08006 (connection_failure).
NonPqErrorPreservesOriginalWrap pins the network-level path.
NilReturnsNil pins defensive contract. All run in -short without
testcontainers — package postgres (internal) so the unexported helper
is callable directly.
- docs/quickstart.md — `> **Warning:**` callout immediately after the
`cp deploy/.env.example deploy/.env` block at lines 56-61. Names the
trap, names the SQLSTATE, gives both remediation paths. Uses the
in-file `> **Note:**` blockquote convention.
- deploy/ENVIRONMENTS.md — `**Stateful volume — first-boot password
binding (U-1)**` paragraph appended to the Postgres expert-note block.
Explains the env-vs-pg_authid divergence, points at wrapPingError as
the runtime diagnostic, lists both remediation paths. Uses the in-file
`**Expert note:**` convention.
Out of scope (separate follow-ups):
- deploy/helm/certctl/templates/postgres-statefulset.yaml has the same
root cause via PVC retention. The wrapPingError diagnostic covers the
Helm path because the same NewDB code runs at server startup; the
Helm-specific doc warning lands separately.
- /.env.example at repo root (line 16 hardcodes the password literally
inside CERTCTL_DATABASE_URL rather than interpolating) — adjacent
trap, separate fix.
- examples/{acme-nginx,private-ca-traefik,step-ca-haproxy,multi-issuer,
acme-wildcard-dns01}/docker-compose.yml all carry the pattern. The
diagnostic covers them; targeted doc warnings are scoped to the
canonical quickstart + ENVIRONMENTS docs.
Out of consideration:
- Switch to bind mount / ephemeral volume — breaks the production path.
- POSTGRES_PASSWORD_FILE + Docker secret + ALTER ROLE rotation — adds
operator surface without fixing the env-vs-pg_authid divergence.
Verification (all passing):
- go build ./...
- go vet ./...
- go test -short -race ./internal/repository/postgres/ — 4/4 new tests
pass plus existing tests
- go test -short ./... — every package green
- govulncheck ./... — no vulnerabilities in our code
- wrapPingError coverage 100%; postgres pkg total unchanged in shape
(NewDB/RunMigrations were 0% pre-fix, still 0% post-fix; new helper
adds 100%-covered statements)
Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap
GitHub Issue #10 (mikeakasully)
134 lines
5.1 KiB
Go
134 lines
5.1 KiB
Go
package postgres
|
|
|
|
import (
|
|
"database/sql"
|
|
"errors"
|
|
"fmt"
|
|
"os"
|
|
"path/filepath"
|
|
"strings"
|
|
|
|
"github.com/lib/pq"
|
|
)
|
|
|
|
// pgErrInvalidPassword is the SQLSTATE for class 28 / code 28P01 —
|
|
// invalid_password — emitted by PostgreSQL when the client presents
|
|
// credentials that don't match pg_authid. Defined locally because the
|
|
// lib/pq package does not export named constants for SQLSTATE codes (it
|
|
// only exposes the typed string alias pq.ErrorCode and a name-lookup map
|
|
// at runtime). Pinned as a string constant rather than a pq.ErrorCode
|
|
// literal so the contract is grep-able from operator-facing log lines.
|
|
//
|
|
// Reference: https://www.postgresql.org/docs/16/errcodes-appendix.html
|
|
const pgErrInvalidPassword = "28P01"
|
|
|
|
// NewDB opens a PostgreSQL database connection and sets up connection pooling.
|
|
func NewDB(connStr string) (*sql.DB, error) {
|
|
db, err := sql.Open("postgres", connStr)
|
|
if err != nil {
|
|
return nil, fmt.Errorf("failed to open database: %w", err)
|
|
}
|
|
|
|
// Configure connection pool
|
|
db.SetMaxOpenConns(25)
|
|
db.SetMaxIdleConns(5)
|
|
|
|
// Ping to verify connection
|
|
if err := db.Ping(); err != nil {
|
|
return nil, wrapPingError(err)
|
|
}
|
|
|
|
return db, nil
|
|
}
|
|
|
|
// wrapPingError converts a db.Ping() failure into an operator-friendly
|
|
// diagnostic. The default wrap is the original opaque
|
|
// `"failed to ping database: <inner>"` shape. The exception is SQLSTATE 28P01
|
|
// (invalid_password): when postgres rejects the server's credentials we emit
|
|
// extended guidance that names the most common operator misstep — editing
|
|
// POSTGRES_PASSWORD in `.env` after the postgres named volume has already
|
|
// been initialized — and lists both the destructive (`docker compose down -v`)
|
|
// and non-destructive (`ALTER ROLE`) remediations.
|
|
//
|
|
// U-1 (P1, GitHub #10): closes the audit-flagged
|
|
// cat-u-quickstart_postgres_password_volume_trap finding. The postgres
|
|
// docker-entrypoint runs initdb only when /var/lib/postgresql/data is empty;
|
|
// on subsequent boots the password baked into pg_authid on first boot wins
|
|
// over whatever the env var carries, so the env-vs-pg_authid divergence is
|
|
// intrinsic to how the postgres image bootstraps and cannot be fixed by us
|
|
// upstream of pg_authid. The ergonomic answer is to surface a clear
|
|
// diagnostic at the failure site so operators don't waste an hour on
|
|
// "is my password right" before discovering the volume needs to be torn
|
|
// down (or the role's password rotated in-place).
|
|
//
|
|
// The wrap chain is preserved via fmt.Errorf("%w", err) so callers using
|
|
// errors.As(err, &*pq.Error) on the returned value continue to work; this
|
|
// matches the audit's "no substring matching on err.Error()" requirement
|
|
// from the M-1 sentinel-error migration.
|
|
//
|
|
// Returns nil when err is nil so callers can defensively pipe through this
|
|
// helper without an extra branch.
|
|
func wrapPingError(err error) error {
|
|
if err == nil {
|
|
return nil
|
|
}
|
|
|
|
var pqErr *pq.Error
|
|
if errors.As(err, &pqErr) && string(pqErr.Code) == pgErrInvalidPassword {
|
|
return fmt.Errorf(
|
|
"failed to ping database: postgres rejected the configured credentials "+
|
|
"(SQLSTATE %s — invalid_password). If you recently rotated POSTGRES_PASSWORD "+
|
|
"on a docker-compose deploy, the postgres container's data volume still "+
|
|
"holds the previous password: initdb seeds POSTGRES_PASSWORD into pg_authid "+
|
|
"only on first boot of a fresh data dir, so editing the env var after that "+
|
|
"point updates only the certctl-server container. Reset destructively with "+
|
|
"`docker compose -f deploy/docker-compose.yml down -v && "+
|
|
"docker compose -f deploy/docker-compose.yml up -d --build` (this DESTROYS "+
|
|
"all data in the postgres volume), or non-destructively with "+
|
|
"`docker compose -f deploy/docker-compose.yml exec postgres "+
|
|
"psql -U certctl -c \"ALTER ROLE certctl PASSWORD '<new-password>';\"` "+
|
|
"and then redeploy with the matching POSTGRES_PASSWORD. Underlying error: %w",
|
|
pgErrInvalidPassword, err)
|
|
}
|
|
|
|
return fmt.Errorf("failed to ping database: %w", err)
|
|
}
|
|
|
|
// RunMigrations reads and executes SQL migration files from a directory.
|
|
func RunMigrations(db *sql.DB, migrationsPath string) error {
|
|
// Check if migrations directory exists
|
|
if _, err := os.Stat(migrationsPath); os.IsNotExist(err) {
|
|
return fmt.Errorf("migrations directory not found: %s", migrationsPath)
|
|
}
|
|
|
|
// Read all SQL files from the migrations directory
|
|
files, err := os.ReadDir(migrationsPath)
|
|
if err != nil {
|
|
return fmt.Errorf("failed to read migrations directory: %w", err)
|
|
}
|
|
|
|
// Sort and filter to only .up.sql migration files (skip .down.sql rollbacks and seed files)
|
|
var sqlFiles []string
|
|
for _, file := range files {
|
|
if !file.IsDir() && strings.HasSuffix(file.Name(), ".up.sql") {
|
|
sqlFiles = append(sqlFiles, file.Name())
|
|
}
|
|
}
|
|
|
|
// Execute each migration file in order
|
|
for _, filename := range sqlFiles {
|
|
filePath := filepath.Join(migrationsPath, filename)
|
|
content, err := os.ReadFile(filePath)
|
|
if err != nil {
|
|
return fmt.Errorf("failed to read migration file %s: %w", filename, err)
|
|
}
|
|
|
|
// Execute the SQL content
|
|
if _, err := db.Exec(string(content)); err != nil {
|
|
return fmt.Errorf("failed to execute migration %s: %w", filename, err)
|
|
}
|
|
}
|
|
|
|
return nil
|
|
}
|