The shipped quickstart instructs operators to copy deploy/.env.example to
deploy/.env, edit POSTGRES_PASSWORD, and run docker compose up. On the
*first* boot of a fresh checkout this works. On the *second* boot — i.e.,
when an operator first booted with the default POSTGRES_PASSWORD=certctl,
then edited .env and re-ran up — the certctl-server container picks up the
new password (env interpolated at every container start) but postgres does
not. The postgres docker-entrypoint runs initdb only when the data dir is
empty; on subsequent boots the persistent named volume postgres_data is
non-empty so pg_authid retains the password baked in on first boot. The
server connects with the new credentials, postgres rejects them, and the
operator sees an opaque `pq: password authentication failed for user
"certctl"` in the server log with no pointer to the actual cause. New-
operator onboarding gets blocked on the documented production path.
Why a doc fix alone is not sufficient. Operators don't reread the docs
after a successful first boot — the trap fires on the *second* up, when
they think they've already learned the system. The opaque pq error is
indistinguishable in the log from a typo'd password or a misconfigured
secret store. The diagnostic has to fire at the moment the failure is
observed.
Why we don't try to fix the bootstrap. The env-vs-pg_authid divergence is
intrinsic to how the official postgres image bootstraps (see
docker-entrypoint.sh: initdb runs only if PGDATA is empty). Switching to a
bind mount or ephemeral volume breaks the production path; switching to
POSTGRES_PASSWORD_FILE + ALTER ROLE adds operator surface without
eliminating the divergence. The ergonomic fix is to surface the failure
mode loudly, with both remediation paths, at the exact log line where it
becomes visible.
Two remediation paths, surfaced together. Destructive: `docker compose
-f deploy/docker-compose.yml down -v && up -d --build` — wipes the
postgres volume so initdb re-runs with the new env value. Use this on
demos / first-time setup where data loss is acceptable. Non-destructive:
`docker compose exec postgres psql -U certctl -c "ALTER ROLE certctl
PASSWORD '<new>';"` followed by a server restart with the matching
POSTGRES_PASSWORD. Use this on any environment that holds data you want
to keep. Surfacing both means the operator can pick based on their
environment without us assuming.
Files changed:
- internal/repository/postgres/db.go — extract wrapPingError(err) helper.
errors.As against *pq.Error; on SQLSTATE 28P01 (invalid_password) emit
the multi-line guidance preserving the %w wrap chain. Non-28P01 errors
retain the original `failed to ping database: %w` shape so transient
connection-refused / timeout paths don't get noisy. Add
pgErrInvalidPassword = "28P01" constant. Convert blank
`_ "github.com/lib/pq"` import to direct import (driver registration
still works via init()) so we can name the *pq.Error type at compile
time. NewDB now calls wrapPingError(err) instead of inlining the wrap.
- internal/repository/postgres/db_test.go (new) — 4 internal-package
unit tests covering wrapPingError. AuthFailureGuidance pins the
contract substrings ("SQLSTATE 28P01", "POSTGRES_PASSWORD",
"first boot", "down -v", "ALTER ROLE"). NonAuthErrorPreservesOriginalWrap
pins the no-leak contract for SQLSTATE 08006 (connection_failure).
NonPqErrorPreservesOriginalWrap pins the network-level path.
NilReturnsNil pins defensive contract. All run in -short without
testcontainers — package postgres (internal) so the unexported helper
is callable directly.
- docs/quickstart.md — `> **Warning:**` callout immediately after the
`cp deploy/.env.example deploy/.env` block at lines 56-61. Names the
trap, names the SQLSTATE, gives both remediation paths. Uses the
in-file `> **Note:**` blockquote convention.
- deploy/ENVIRONMENTS.md — `**Stateful volume — first-boot password
binding (U-1)**` paragraph appended to the Postgres expert-note block.
Explains the env-vs-pg_authid divergence, points at wrapPingError as
the runtime diagnostic, lists both remediation paths. Uses the in-file
`**Expert note:**` convention.
Out of scope (separate follow-ups):
- deploy/helm/certctl/templates/postgres-statefulset.yaml has the same
root cause via PVC retention. The wrapPingError diagnostic covers the
Helm path because the same NewDB code runs at server startup; the
Helm-specific doc warning lands separately.
- /.env.example at repo root (line 16 hardcodes the password literally
inside CERTCTL_DATABASE_URL rather than interpolating) — adjacent
trap, separate fix.
- examples/{acme-nginx,private-ca-traefik,step-ca-haproxy,multi-issuer,
acme-wildcard-dns01}/docker-compose.yml all carry the pattern. The
diagnostic covers them; targeted doc warnings are scoped to the
canonical quickstart + ENVIRONMENTS docs.
Out of consideration:
- Switch to bind mount / ephemeral volume — breaks the production path.
- POSTGRES_PASSWORD_FILE + Docker secret + ALTER ROLE rotation — adds
operator surface without fixing the env-vs-pg_authid divergence.
Verification (all passing):
- go build ./...
- go vet ./...
- go test -short -race ./internal/repository/postgres/ — 4/4 new tests
pass plus existing tests
- go test -short ./... — every package green
- govulncheck ./... — no vulnerabilities in our code
- wrapPingError coverage 100%; postgres pkg total unchanged in shape
(NewDB/RunMigrations were 0% pre-fix, still 0% post-fix; new helper
adds 100%-covered statements)
Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap
GitHub Issue #10 (mikeakasully)
Breaking change release. Plaintext HTTP listener removed. The certctl
control plane now terminates TLS 1.3 on :8443 via
http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape
hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md.
Server
- cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert
swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback),
preflightServerTLS validation
- cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe,
watchSIGHUP wiring, cert/key path config threading
- tls_test.go: 418-line regression coverage of reload, preflight,
callback behavior, SAN validation
Config
- CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required)
- Plaintext rejection: agents/CLI/MCP pre-flight-fail on http://
URLs with a pointer to docs/upgrade-to-tls.md
Agents, CLI, MCP
- All three pre-flight-reject http:// URLs with fail-loud diagnostic
- CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust
- CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass
(loud warning on startup)
- install-agent.sh emits both vars as commented template lines
docker-compose
- certctl-tls-init sidecar generates SAN-valid self-signed cert into
deploy/test/certs/ on first boot
- All demo-stack curls pin against ca.crt with --cacert
Helm chart
- Three TLS provisioning modes, exactly one required:
- server.tls.existingSecret (operator-supplied)
- server.tls.certManager.enabled (cert-manager integration)
- server.tls.selfSigned.enabled (eval only — not for production)
- server-certificate.yaml template for cert-manager mode
- helm install without a TLS source fails at template render with
a pointer to docs/tls.md
CI
- .github/workflows/ci.yml Helm Chart Validation step renders the
chart in both existingSecret and cert-manager modes, plus an
inverse guard-regression test that asserts helm template MUST
refuse to render when no TLS source is configured. Previously
the single `helm template` invocation hit the certctl.tls.required
fail-loud guard and exit-1'd CI. Four invocations now: lint
(existingSecret), template (existingSecret), template
(cert-manager), template (no args — must fail).
Integration tests
- deploy/test/integration_test.go stands up the Compose stack over
HTTPS, extracts the CA bundle, and exercises every certctl API
over https://localhost:8443
- All 34 integration subtests green (per Phase 8 local CI-parity)
Documentation
- New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload)
- New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade
warnings, fleet-roll sequencing)
- CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry
(file heading unchanged; release tag is v2.0.47)
- All curls in docs/, examples/, deploy/helm/ guides use
https://localhost:8443 --cacert
Verification
- grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits
- grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin
API default, SSRF doc comment) — zero certctl endpoints
- Tasks #197–#206 (Phases 0–8) all closed in the tracker
Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix).
Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006)
on top of the C-001/C-002 ownership + agent-FK contract fixes landed in
5c01c7f. The work lands as a single commit spanning server, docs, tests,
and the React client.
M-002 — Named API keys with per-key actor propagation
* Migration 000014 adds the 'api_keys' table (id, name, hash,
principal, role, created_at, last_used_at, disabled_at) so every
credential carries an identifiable principal instead of the
opaque 'anonymous'/'api-key' sentinel.
* Auth middleware now rotates through configured keys, performs
constant-time hash comparison, stamps 'last_used_at', and emits
an actor struct via contextWithActor(). The audit middleware,
bulk-revocation handler, approval handlers, and MCP tool layer
now read the principal off the context and persist it on every
audit_events row.
* Regression coverage:
- internal/api/middleware/audit_test.go — actor propagation,
principal redaction for disabled keys, anonymous fallback for
unauthenticated endpoints.
- internal/api/handler/bulk_revocation_handler_test.go,
job_handler_test.go — principal-on-audit assertions.
M-003 — Authorization gates (Phase B)
* Approval handler rejects self-approval / self-rejection with 403
when the actor principal equals the job's requested_by field.
* Bulk revocation is gated behind the 'admin' role; operators and
viewers receive 403.
* Regression coverage:
- internal/service/job_test.go — TestApproveJob_NotSelf,
TestRejectJob_NotSelf.
- internal/api/handler/bulk_revocation_handler_test.go —
TestBulkRevoke_RequiresAdmin, TestBulkRevoke_AdminSucceeds.
M-006 — RFC-compliant CRL/OCSP on the unauthenticated .well-known mux
* Per RFC 8615, relying parties cannot reasonably be asked to
authenticate against the issuing certctl instance to retrieve
revocation material. CRL and OCSP move off the authenticated
'/api/v1/crl*' and '/api/v1/ocsp/*' paths onto:
GET /.well-known/pki/crl/{issuer_id}
Content-Type: application/pkix-crl (RFC 5280 §5)
GET /.well-known/pki/ocsp/{issuer_id}/{serial}
Content-Type: application/ocsp-response (RFC 6960)
* Non-standard JSON CRL shape is removed; only DER is served.
* Short-lived certificate exemption (profile TTL < 1h → skip
CRL/OCSP) is preserved; the response simply omits the serial.
* Routes are registered on the unauthenticated 'finalHandler' mux
in cmd/server/main.go alongside EST ('/.well-known/est/*') and
SCEP ('/scep'). Legacy authenticated paths return 404.
* Regression coverage:
- internal/api/handler/certificate_handler_test.go — content
type, DER parseability, 404 for unknown issuer.
- internal/api/handler/adversarial_path_test.go — unauthenticated
access asserted for CRL, OCSP, EST, SCEP.
- internal/api/router/router_test.go — route-table assertion
that '.well-known/pki/*', '.well-known/est/*', and '/scep' are
mounted on the unauthenticated branch.
M-001 — Auto-closed by M-002
EST and SCEP were already registered on the unauthenticated
'finalHandler' mux; the router comment at
internal/api/router/router.go:247 now matches reality. The
adversarial-path tests above lock the behavior in.
Verification (all gates green):
* go vet ./... — clean
* go build ./... — ok
* go test -short ./... (55+ packages) — all pass
* web/ : npm test (225 Vitest tests) — all pass
* web/ : npx tsc --noEmit — clean
* grep sweep for '/api/v1/(crl|ocsp)' — 13 surviving hits,
all intentional M-006 tombstone/relocation comments.
Documentation:
* coverage-gap-audit.md — status flips M-001/M-002/M-003/M-006 →
Fixed, with per-finding resolution paragraphs citing regression
test IDs. (Audit file lives outside this repo; see cowork root.)
* CLAUDE.md Project Status line updated with the auth-unification
closure note.
* docs/features.md, docs/architecture.md, docs/quickstart.md,
docs/concepts.md, docs/connectors.md, docs/test-env.md,
docs/testing-guide.md, docs/compliance-*.md, docs/demo-advanced.md
— refreshed for the new '.well-known/pki/*' namespace and named
API keys.
* api/openapi.yaml — documents the new unauthenticated endpoints
and removes the legacy '/api/v1/crl*' + '/api/v1/ocsp/*' paths.
.gitignore: adds '/.gocache/' and '/.gomodcache/' for the session-
scoped Go caches so they never enter the tree.
- New deploy/ENVIRONMENTS.md: comprehensive walkthrough of all 4 compose
files with service-by-service explanations, beginner-friendly Docker
concepts, and expert-level networking/config details
- Fix docker-compose.dev.yml: agent LOG_LEVEL → CERTCTL_LOG_LEVEL (was
silently ignored without the CERTCTL_ prefix)
- Add CERTCTL_CONFIG_ENCRYPTION_KEY to base and test compose (enables
M34/M35 dynamic issuer/target config encryption)
- Add CERTCTL_DISCOVERY_DIRS to base compose agent (enables filesystem
certificate discovery in default deployment)
- Cross-link ENVIRONMENTS.md from README doc table and quickstart.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create docs/examples.md as the central entry point for all 5 turnkey
docker-compose scenarios with a decision matrix, per-example summaries,
and contextual migration guide links. Update quickstart.md to bridge
from demo to real deployment. Consolidate README docs table (10 rows
from 13). Fix Vault PKI "(planned)" in cert-manager guide.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clarify that Docker Compose demo runs with auth disabled and
explain how to enable API key auth for production deployments.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs fixed:
- Docker Compose only mounted migration 000001; migrations 000002-000007
(profiles, agent groups, revocation, discovery, network scans) never ran,
breaking half the demo features. Now mounts all 7 migrations in order.
- Network Scans page crashed with pq.Array scan error because lib/pq
doesn't support []int, only []int64. Changed Ports field accordingly.
- Dashboard pie chart displayed "RenewalInProgress" without spaces.
Added formatStatus() helper for PascalCase → spaced display.
Also adds first-run demo experience improvements:
- 9 discovered certificates (filesystem + network scan mix)
- 3 discovery scans with recent timestamps
- 2 AwaitingApproval renewal jobs for approval workflow demo
- CERTCTL_NETWORK_SCAN_ENABLED=true in Docker Compose
- Network scan targets seeded with last_scan results
- Version badge updated to v2.0.5
- Docs updated (quickstart, advanced demo) to reference seeded data
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three new GUI surfaces closing the backend-to-frontend gap for V2:
- Discovery triage page: summary stats bar, DataTable with claim/dismiss
actions, status/agent filters, collapsible scan history panel
- Network scan target management: CRUD with create modal, enable/disable
toggle, Scan Now button, last scan results display
- Jobs page approval workflow: Approve/Reject buttons for AwaitingApproval
jobs, rejection reason modal, pending approval banner with count,
AwaitingApproval/AwaitingCSR added to status filter dropdown
Also adds 13 new frontend tests, 4 API types, 12 API client functions,
2 sidebar nav items, 2 routes, and discovery status badge styles.
Docs updated: README, architecture, quickstart, demo-advanced, CLAUDE.md,
roadmap. Version bumped to v2.0.4.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Navigation menus for testing guide, architecture, concepts,
connectors, quickstart, advanced demo, and three compliance docs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement Enrollment over Secure Transport protocol with 4 endpoints under
/.well-known/est/ — cacerts (CA chain distribution), simpleenroll (initial
enrollment), simplereenroll (certificate renewal), and csrattrs (CSR
attributes). PKCS#7 certs-only wire format with hand-rolled ASN.1, accepts
both PEM and base64-encoded DER CSRs, configurable issuer and profile
binding, full audit trail. 28 new tests (18 handler + 10 service).
Also includes:
- GetCACertPEM added to issuer connector interface (all 4 issuers updated)
- EST integration tests wired into e2e test suite (13 test cases)
- QA testing guide Part 26 (15 manual EST test cases)
- All docs updated: README, features, architecture, concepts, connectors,
quickstart, demo-advanced (endpoint counts, MCP wording, agent IDs,
issuer interface, resource lists, OpenSSL status)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consolidated two overlapping docs into one cohesive guide framed around
the 47-day certificate lifespan reduction. Covers setup, dashboard
walkthrough, API exploration, cert creation, discovery, CLI, MCP, demo
data reference, and a 10-step stakeholder presentation flow.
Removed demo-guide.md and updated all cross-references in README,
compliance-pci-dss.md, and testing-guide.md.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
M21 adds server-side active TLS scanning of CIDR ranges with concurrent
probing, sentinel agent pattern for pipeline reuse, and full CRUD API for
scan targets. M22 adds Prometheus exposition format endpoint alongside
existing JSON metrics. Comprehensive documentation audit updates all docs
to reflect 91 endpoints, 19 tables, 6 scheduler loops, and 900+ tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add missing V2 concepts (Certificate Profiles, Revocation with CRL/OCSP,
Short-Lived Certificates, CLI, MCP Server, Observability) to concepts guide.
Update quickstart with revocation examples, sorting/filtering, cursor pagination,
sparse fields, stats/metrics, and approval workflows. Align 5-minute demo guide
and advanced demo to full V2 feature set including revocation workflows, bulk ops,
fleet overview, and dashboard charts. Update architecture with MCP server section,
5th scheduler loop, API audit log, and 860+ test count. Add revocation-across-issuers
section to connectors guide.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
README: lead with CA/Browser Forum Ballot SC-081v3 (47-day certs by 2029)
and certctl's end-to-end automation positioning. Update architecture
diagram and target lists to include Apache/HAProxy. Update roadmap
with new M15 (Revocation Infrastructure), renumbered M16-M18, and
V3.1 cert-manager/IAM Roles Anywhere additions.
concepts.md: rewrite "Why Do Certificates Expire?" with shrinking
lifespan timeline and automation imperative.
quickstart.md: add 47-day framing in intro.
architecture.md: add Apache/HAProxy to system diagram, target connector
diagram, deployment section, and ER diagram (agent metadata columns).
Update planned targets list for V3.1. Fix test count (230+).
connectors.md: fix notifier planned version reference (V2 not V2.1).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix demo certificate count: 14 → 15 across README, quickstart,
demo-guide (wildcard cert was added but count never updated)
- Fix negative_test subtest count: 12 → 14 in architecture.md
- Update README roadmap: v1.0.0 released (no longer "tag pending")
- Update status badge: "active development" → "v1.0.0"
- Remove stale POSTGRES_IMPLEMENTATION.md and POSTGRES_PATTERNS.md
(scaffold-era dev notes, not referenced anywhere)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without --build, Docker reuses cached images that don't include the
built frontend, resulting in a blank page. Every doc that tells users
to run docker compose up now includes --build.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- CLAUDE.md: check off frontend tests (53 Vitest tests done), update test count to 220+, update endpoint count to 55, update CI description
- README.md: add missing API endpoints (PUT/DELETE for issuers, targets, teams, owners, policies; POST notifications/{id}/read; auth endpoints), update endpoint count from 40+ to 55, update test count to 220+
- architecture.md: add frontend test layer description, update CI section with Vitest step, update dashboard description with action buttons (create cert modal, deploy, archive, test issuer, enable/disable policy, delete)
- demo-guide.md: fix incorrect /api/v1/policies/violations endpoint to /api/v1/policies/{id}/violations, update "Demo Without Docker" section from stale web/index.html to Vite dev server
- quickstart.md: fix auto-generated ID format from UUID to name-timestamp format
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Wire issuer connector end-to-end with IssuerConnectorAdapter (dependency inversion)
- Renewal/issuance job processor: RSA key + CSR generation, Local CA signing, cert version storage
- Agent work API (GET /agents/{id}/work) and job status API (POST /agents/{id}/jobs/{job_id}/status)
- Agent-side deployment: WorkItem enrichment with target type/config, NGINX/F5/IIS connector invocation
- Full ACME v2 implementation: HTTP-01 challenge solving, account registration, order lifecycle
- Update all docs (README, architecture, connectors, demo-advanced, quickstart) for M1-M2
- Fix go vet warning in deployment.go (non-constant format string)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runtime fixes:
- Fix env var mismatch (CERTCTL_DB_URL → CERTCTL_DATABASE_URL)
- Fix table name mismatches (certificates → managed_certificates, notifications → notification_events)
- Add renewal_policy_id to certificate queries
- Remove non-existent created_at from notification queries
- Add env var fallback for agent CLI flags
- Graceful degradation for missing notifiers/issuers in demo mode
- Copy web/ directory in Dockerfile for dashboard serving
Service layer:
- Implement handler-service interface pattern across all services
- Wire up certificate, agent, job, policy, team, owner, audit, notification services
Documentation:
- Add concepts.md: beginner-friendly guide to TLS, CAs, private keys
- Rewrite quickstart.md with accurate API examples matching actual handlers
- Add demo-advanced.md: interactive demo with cert issuance and automated script
- Update architecture.md with correct table names and connector interfaces
- Update connectors.md to match actual Go interface signatures
- Update demo-guide.md with cross-references to new docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>