mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 18:51:32 +00:00
a91197014f
The shipped quickstart instructs operators to copy deploy/.env.example to
deploy/.env, edit POSTGRES_PASSWORD, and run docker compose up. On the
*first* boot of a fresh checkout this works. On the *second* boot — i.e.,
when an operator first booted with the default POSTGRES_PASSWORD=certctl,
then edited .env and re-ran up — the certctl-server container picks up the
new password (env interpolated at every container start) but postgres does
not. The postgres docker-entrypoint runs initdb only when the data dir is
empty; on subsequent boots the persistent named volume postgres_data is
non-empty so pg_authid retains the password baked in on first boot. The
server connects with the new credentials, postgres rejects them, and the
operator sees an opaque `pq: password authentication failed for user
"certctl"` in the server log with no pointer to the actual cause. New-
operator onboarding gets blocked on the documented production path.
Why a doc fix alone is not sufficient. Operators don't reread the docs
after a successful first boot — the trap fires on the *second* up, when
they think they've already learned the system. The opaque pq error is
indistinguishable in the log from a typo'd password or a misconfigured
secret store. The diagnostic has to fire at the moment the failure is
observed.
Why we don't try to fix the bootstrap. The env-vs-pg_authid divergence is
intrinsic to how the official postgres image bootstraps (see
docker-entrypoint.sh: initdb runs only if PGDATA is empty). Switching to a
bind mount or ephemeral volume breaks the production path; switching to
POSTGRES_PASSWORD_FILE + ALTER ROLE adds operator surface without
eliminating the divergence. The ergonomic fix is to surface the failure
mode loudly, with both remediation paths, at the exact log line where it
becomes visible.
Two remediation paths, surfaced together. Destructive: `docker compose
-f deploy/docker-compose.yml down -v && up -d --build` — wipes the
postgres volume so initdb re-runs with the new env value. Use this on
demos / first-time setup where data loss is acceptable. Non-destructive:
`docker compose exec postgres psql -U certctl -c "ALTER ROLE certctl
PASSWORD '<new>';"` followed by a server restart with the matching
POSTGRES_PASSWORD. Use this on any environment that holds data you want
to keep. Surfacing both means the operator can pick based on their
environment without us assuming.
Files changed:
- internal/repository/postgres/db.go — extract wrapPingError(err) helper.
errors.As against *pq.Error; on SQLSTATE 28P01 (invalid_password) emit
the multi-line guidance preserving the %w wrap chain. Non-28P01 errors
retain the original `failed to ping database: %w` shape so transient
connection-refused / timeout paths don't get noisy. Add
pgErrInvalidPassword = "28P01" constant. Convert blank
`_ "github.com/lib/pq"` import to direct import (driver registration
still works via init()) so we can name the *pq.Error type at compile
time. NewDB now calls wrapPingError(err) instead of inlining the wrap.
- internal/repository/postgres/db_test.go (new) — 4 internal-package
unit tests covering wrapPingError. AuthFailureGuidance pins the
contract substrings ("SQLSTATE 28P01", "POSTGRES_PASSWORD",
"first boot", "down -v", "ALTER ROLE"). NonAuthErrorPreservesOriginalWrap
pins the no-leak contract for SQLSTATE 08006 (connection_failure).
NonPqErrorPreservesOriginalWrap pins the network-level path.
NilReturnsNil pins defensive contract. All run in -short without
testcontainers — package postgres (internal) so the unexported helper
is callable directly.
- docs/quickstart.md — `> **Warning:**` callout immediately after the
`cp deploy/.env.example deploy/.env` block at lines 56-61. Names the
trap, names the SQLSTATE, gives both remediation paths. Uses the
in-file `> **Note:**` blockquote convention.
- deploy/ENVIRONMENTS.md — `**Stateful volume — first-boot password
binding (U-1)**` paragraph appended to the Postgres expert-note block.
Explains the env-vs-pg_authid divergence, points at wrapPingError as
the runtime diagnostic, lists both remediation paths. Uses the in-file
`**Expert note:**` convention.
Out of scope (separate follow-ups):
- deploy/helm/certctl/templates/postgres-statefulset.yaml has the same
root cause via PVC retention. The wrapPingError diagnostic covers the
Helm path because the same NewDB code runs at server startup; the
Helm-specific doc warning lands separately.
- /.env.example at repo root (line 16 hardcodes the password literally
inside CERTCTL_DATABASE_URL rather than interpolating) — adjacent
trap, separate fix.
- examples/{acme-nginx,private-ca-traefik,step-ca-haproxy,multi-issuer,
acme-wildcard-dns01}/docker-compose.yml all carry the pattern. The
diagnostic covers them; targeted doc warnings are scoped to the
canonical quickstart + ENVIRONMENTS docs.
Out of consideration:
- Switch to bind mount / ephemeral volume — breaks the production path.
- POSTGRES_PASSWORD_FILE + Docker secret + ALTER ROLE rotation — adds
operator surface without fixing the env-vs-pg_authid divergence.
Verification (all passing):
- go build ./...
- go vet ./...
- go test -short -race ./internal/repository/postgres/ — 4/4 new tests
pass plus existing tests
- go test -short ./... — every package green
- govulncheck ./... — no vulnerabilities in our code
- wrapPingError coverage 100%; postgres pkg total unchanged in shape
(NewDB/RunMigrations were 0% pre-fix, still 0% post-fix; new helper
adds 100%-covered statements)
Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap
GitHub Issue #10 (mikeakasully)
169 lines
6.4 KiB
Go
169 lines
6.4 KiB
Go
// Internal-package tests for db.go — covers the diagnostic dispatch in
|
|
// wrapPingError. Lives in `package postgres` (not `postgres_test`) so it can
|
|
// call the unexported helper directly without exposing it on the API surface.
|
|
//
|
|
// Sibling integration tests in this directory live in `package postgres_test`
|
|
// (testcontainers-driven, schema-per-test). They exercise the live-DB
|
|
// happy path; this file owns the unit-level diagnostic dispatch and runs in
|
|
// `-short` mode without spinning up postgres.
|
|
//
|
|
// U-1 (P1, GitHub #10): closes the audit-flagged
|
|
// cat-u-quickstart_postgres_password_volume_trap finding by pinning the
|
|
// post-fix wrap-text contract for `db.Ping()` failures. Pre-U-1 every Ping
|
|
// error was wrapped with the same opaque `"failed to ping database: %w"`,
|
|
// so an operator who edited POSTGRES_PASSWORD after first-boot saw only
|
|
// `pq: password authentication failed for user "certctl"` in the server
|
|
// log with no pointer to the actual cause (postgres data dir retains the
|
|
// initial password from first-boot initdb; subsequent boots ignore the env
|
|
// var). Post-U-1 the SQLSTATE-28P01 path emits a multi-line diagnostic
|
|
// pointing at the down -v / ALTER ROLE remediation; non-auth failures
|
|
// retain the original wrap shape so verbose noise does not bleed into
|
|
// transient connection-refused / timeout paths.
|
|
package postgres
|
|
|
|
import (
|
|
"errors"
|
|
"strings"
|
|
"testing"
|
|
|
|
"github.com/lib/pq"
|
|
)
|
|
|
|
// TestWrapPingError_AuthFailureGuidance asserts the diagnostic wrap fires on
|
|
// SQLSTATE 28P01 (invalid_password) and contains all three contract elements:
|
|
// the SQLSTATE code (so operators can grep), the down-v destructive
|
|
// remediation, and the ALTER ROLE non-destructive remediation. Also asserts
|
|
// the wrap chain still satisfies errors.As(err, &*pq.Error) so callers that
|
|
// programmatically inspect the underlying postgres error code keep working.
|
|
func TestWrapPingError_AuthFailureGuidance(t *testing.T) {
|
|
t.Parallel()
|
|
|
|
original := &pq.Error{
|
|
Code: pq.ErrorCode("28P01"),
|
|
Message: `password authentication failed for user "certctl"`,
|
|
}
|
|
|
|
wrapped := wrapPingError(original)
|
|
if wrapped == nil {
|
|
t.Fatal("wrapPingError returned nil for a non-nil input")
|
|
}
|
|
|
|
got := wrapped.Error()
|
|
|
|
// Contract elements — the operator-facing string is what we ship.
|
|
wantSubstrings := []string{
|
|
"SQLSTATE 28P01", // operators grep on this
|
|
"POSTGRES_PASSWORD", // names the variable that traps
|
|
"first boot", // the mechanism in plain language
|
|
"down -v", // destructive remediation
|
|
"ALTER ROLE", // non-destructive remediation
|
|
}
|
|
for _, s := range wantSubstrings {
|
|
if !strings.Contains(got, s) {
|
|
t.Errorf("wrap text missing %q\ngot: %s", s, got)
|
|
}
|
|
}
|
|
|
|
// Wrap chain must still expose the underlying *pq.Error for callers
|
|
// that want to inspect Code / Detail / Constraint fields. Pre-fix
|
|
// callers used errors.As(err, &pqErr) on the unwrapped Ping result;
|
|
// the new wrap is fmt.Errorf("...%w", err) so errors.As must walk it.
|
|
var pqErr *pq.Error
|
|
if !errors.As(wrapped, &pqErr) {
|
|
t.Fatalf("errors.As did not extract *pq.Error from wrapped chain: %v", wrapped)
|
|
}
|
|
if pqErr.Code != "28P01" {
|
|
t.Errorf("extracted pq.Error.Code = %q, want %q", pqErr.Code, "28P01")
|
|
}
|
|
}
|
|
|
|
// TestWrapPingError_NonAuthErrorPreservesOriginalWrap guards against the
|
|
// guidance text bleeding into unrelated failure modes. SQLSTATE 08006
|
|
// (connection_failure) is the canonical non-auth case — server unreachable,
|
|
// TLS handshake failure, network drop. The wrap should be the original
|
|
// shape so transient-error log noise does not include the (now lengthy)
|
|
// volume-state remediation paragraph.
|
|
func TestWrapPingError_NonAuthErrorPreservesOriginalWrap(t *testing.T) {
|
|
t.Parallel()
|
|
|
|
original := &pq.Error{
|
|
Code: pq.ErrorCode("08006"),
|
|
Message: "connection refused",
|
|
}
|
|
|
|
wrapped := wrapPingError(original)
|
|
if wrapped == nil {
|
|
t.Fatal("wrapPingError returned nil for a non-nil input")
|
|
}
|
|
|
|
got := wrapped.Error()
|
|
|
|
// Original-wrap shape: prefix only, no guidance text.
|
|
const wantPrefix = "failed to ping database: "
|
|
if !strings.HasPrefix(got, wantPrefix) {
|
|
t.Errorf("expected prefix %q, got: %s", wantPrefix, got)
|
|
}
|
|
|
|
// Negative assertions: guidance text MUST NOT appear on non-auth paths.
|
|
mustNotContain := []string{
|
|
"SQLSTATE 08006", // we only call out 28P01 specifically
|
|
"POSTGRES_PASSWORD",
|
|
"down -v",
|
|
"ALTER ROLE",
|
|
}
|
|
for _, s := range mustNotContain {
|
|
if strings.Contains(got, s) {
|
|
t.Errorf("non-auth wrap leaked guidance substring %q\ngot: %s", s, got)
|
|
}
|
|
}
|
|
|
|
// Wrap chain still walks for errors.As — same contract as auth path.
|
|
var pqErr *pq.Error
|
|
if !errors.As(wrapped, &pqErr) {
|
|
t.Fatalf("errors.As did not extract *pq.Error from non-auth wrapped chain: %v", wrapped)
|
|
}
|
|
if pqErr.Code != "08006" {
|
|
t.Errorf("extracted pq.Error.Code = %q, want %q", pqErr.Code, "08006")
|
|
}
|
|
}
|
|
|
|
// TestWrapPingError_NonPqErrorPreservesOriginalWrap guards the network-level
|
|
// case: a pre-handshake failure (TCP refused, DNS, TLS) returns a
|
|
// non-*pq.Error from db.Ping(). errors.As must return false, the helper
|
|
// must fall through to the generic wrap, and the chain must remain walkable.
|
|
func TestWrapPingError_NonPqErrorPreservesOriginalWrap(t *testing.T) {
|
|
t.Parallel()
|
|
|
|
original := errors.New("dial tcp 127.0.0.1:5432: connect: connection refused")
|
|
|
|
wrapped := wrapPingError(original)
|
|
if wrapped == nil {
|
|
t.Fatal("wrapPingError returned nil for a non-nil input")
|
|
}
|
|
|
|
got := wrapped.Error()
|
|
|
|
const wantPrefix = "failed to ping database: "
|
|
if !strings.HasPrefix(got, wantPrefix) {
|
|
t.Errorf("expected prefix %q, got: %s", wantPrefix, got)
|
|
}
|
|
if strings.Contains(got, "SQLSTATE") || strings.Contains(got, "POSTGRES_PASSWORD") {
|
|
t.Errorf("network-level wrap leaked SQLSTATE/postgres guidance\ngot: %s", got)
|
|
}
|
|
if !errors.Is(wrapped, original) {
|
|
t.Errorf("errors.Is did not walk to original sentinel: %v", wrapped)
|
|
}
|
|
}
|
|
|
|
// TestWrapPingError_NilReturnsNil — defensive contract: if Ping returned nil
|
|
// (no failure), the helper must not synthesize a fake error. This isn't on
|
|
// the documented call path (NewDB only invokes wrapPingError inside the
|
|
// `if err != nil` branch), but pinning it prevents a future refactor from
|
|
// regressing the contract silently.
|
|
func TestWrapPingError_NilReturnsNil(t *testing.T) {
|
|
t.Parallel()
|
|
if got := wrapPingError(nil); got != nil {
|
|
t.Errorf("wrapPingError(nil) = %v, want nil", got)
|
|
}
|
|
}
|