9 Commits

Author SHA1 Message Date
shankar0123 8191b1ee64 scheduler+db: close Phase 6 — scale hardening across pool, jitter, ETag, asyncpoll
Phase 6 of the certctl architecture diligence remediation. Five
findings across the same scheduler-and-DB-pool surface.

SCALE-M1 (Med) — DB pool default bumped 25 → 50
  internal/config/config.go line 1972:
    MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50)
  Postgres default max_connections is 100; 50 leaves headroom for
  pg_dump + ad-hoc psql + a server replica without exhausting the
  DB-side cap. Operator override env var unchanged. Operator-tune
  ladder for larger fleets (5K / 50K certs) lives in
  docs/operator/scale.md as starter values pending Phase 8 load
  tests — explicitly marked TBD.

SCALE-M3 (Med) — async-CA poll budget operator-configurable
  Live state was partially-already-shipped: all 4 async-CA
  connectors (digicert, entrust, globalsign, sectigo) already have
  per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5
  closed pre-Phase-6). What was missing: a global package-default
  override. Shipped:
    - internal/connector/issuer/asyncpoll/asyncpoll.go gains
      SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the
      currentDefaultMaxWait() priority resolver.
    - cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS
      at boot and calls SetDefaultMaxWait.
    - deploy/ENVIRONMENTS.md documents the new env var (G-3 guard
      green).
  Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS:
  the live code tracks wall-clock time (MaxWait), not attempt count.
  Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS)
  so the priority chain reads naturally.

SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops
  internal/scheduler/jitter.go ships NewJitteredTicker(interval,
  jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in
  internal/scheduler/scheduler.go migrated from bare time.NewTicker
  to NewJitteredTicker(interval, DefaultSchedulerJitter). Base
  intervals unchanged; only the per-tick envelope adds ±10%
  randomized delay so multiple loops with the same nominal cadence
  don't co-fire and spike CPU + DB at wall-clock boundaries.

  internal/scheduler/jitter_test.go pins:
    - Bounded envelope (each tick within ±jitterPct of interval)
    - Mean drift < 30% of nominal (sign-bug detector)
    - Stop() releases the goroutine + closes C
    - Stop() idempotent (no panic on repeat)
    - Zero-jitter behaves like time.NewTicker
    - Negative and >=1 jitterPct values clamped defensively

  CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks
  any future bare time.NewTicker in scheduler.go.

SCALE-L1 (Low) — renewal-sweep semaphore behavior documented
  docs/operator/scale.md "Scheduler tick budgets" section explains
  the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25
  default), the ctx-cancellation drain on tick-budget overrun, and
  operator tuning advice (raise concurrency + DB pool together).
  No code change — the behavior is defensible as-is per the audit.

SCALE-L2 (Low) — ETag middleware for top-5 read endpoints
  internal/api/middleware/etag.go computes SHA-256 ETag over the
  buffered response body, respects If-None-Match, short-circuits
  to 304 Not Modified on match. GET/HEAD only; non-2xx responses
  pass through unchanged. 64 KiB buffer cap degrades gracefully on
  oversized responses (no caching, body still flushes intact).

  Wired around the top-5 read endpoints via etagged() helper in
  internal/api/router/router.go:
    GET /api/v1/certificates
    GET /api/v1/agents
    GET /api/v1/jobs
    GET /api/v1/audit
    GET /api/v1/discovered-certificates

  internal/api/middleware/etag_test.go pins 11 behaviors including
  304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass,
  4xx/5xx pass-through, oversized-response degradation, wildcard
  match, HEAD-treated-like-GET, byte-equal pass-through.

Cross-cutting fixes:
  - internal/config/config_test.go::TestLoad_DefaultValues updated
    to assert the new 50 default (was 25).
  - deploy/helm/certctl/values.yaml comment corrected — agent
    pollInterval is hardcoded 30s, not env-configurable; the
    Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL
    which G-3 caught as a phantom env var.
  - asyncpoll.go reformatted by gofmt; functionally unchanged.

Verification (all pass):
  grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go    # finds 1 site
  grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go  # config default is 50
  grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md  # wired
  grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go    # 0 (all migrated)
  grep -cE 'JitteredTicker' internal/scheduler/scheduler.go         # 15
  ls internal/scheduler/jitter.go internal/api/middleware/etag.go   # both exist
  ls docs/operator/scale.md                                          # exists
  bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh          # clean
  bash scripts/ci-guards/G-3-env-docs-drift.sh                      # clean
  go test ./internal/scheduler/ ./internal/api/middleware/ \
    ./internal/connector/issuer/asyncpoll/ ./internal/config/       # 4/4 packages green

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2
2026-05-14 01:23:03 +00:00
shankar0123 69a2b5c55a config: default hardening + operator docs (Phase 2 closure — SEC-H1, SEC-H3, SEC-M4, DEPL-H1, DEPL-M2 + doc-only carve-outs)
Eleven findings from the architecture diligence audit's Phase 2 bundle
closed in one PR. All touch the same backend config + Helm chart +
operator docs surface, so reviewing in one diff is the natural fit.

config.go: three new fail-closed Validate() branches behind sentinels
=====================================================================

Three new error sentinels exported from internal/config/config.go for
tests to pin via errors.Is + message-text:
  - ErrAgentBootstrapTokenRequired (SEC-H1)
  - ErrACMEInsecureWithoutAck      (SEC-M4)
  - ErrDemoModeAckExpired          (SEC-H3)

SEC-H1 (staged): introduces CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY
as an opt-in feature flag. When true AND the bootstrap token is empty,
Validate() returns ErrAgentBootstrapTokenRequired and the server
refuses to start. Default in THIS release: false (warn-mode
pass-through preserved). WORKSPACE-ROADMAP.md schedules the default
flip to true for v2.2.0 — operators get one upgrade window.

SEC-M4: upgrades the existing boot-time WARN log for
CERTCTL_ACME_INSECURE=true into a hard refuse-to-start gate behind
CERTCTL_ACME_INSECURE_ACK=true. The ACK env var must be paired with
the existing INSECURE flag; either alone fails closed. The boot-time
WARN log at cmd/server/main.go:611 continues to fire for the ACK'd
case so every restart logs the reminder.

SEC-H3: tightens the sticky DemoModeAck bit so it expires after 24h.
When DemoModeAck=true, Validate() now requires CERTCTL_DEMO_MODE_ACK_TS
to be set as a unix-epoch timestamp within the last 24h (24h-tolerance
on the past side, 1-minute clock-skew on the future side). Catches the
"forgotten demo deployment promoted to production" failure mode —
next container restart past 24h refuses unless re-ack'd.

Tests in internal/config/config_test.go cover every new branch:
positive (passes when properly set), negative (each fail-closed path
fires with the matching sentinel + message-text). 11 new tests added.

Helm chart + HA runbook (DEPL-H1)
=================================

Created docs/operator/runbooks/ha.md documenting the three values
flips required for production HA: server.replicas, podDisruptionBudget,
service.sessionAffinity. Cross-link comments added to
deploy/helm/certctl/values.yaml next to the server.replicas (line 19)
and podDisruptionBudget (line 566) defaults. DEFAULTS DO NOT CHANGE
— that's the point per the prompt's 'do not flip networkPolicy default'
guidance: a default-enabled PDB blocks fresh helm install on
single-node clusters.

CI guard (DEPL-M2)
==================

scripts/ci-guards/no-change-me-in-prod-compose.sh grep-fails any
'change-me-' literal in compose files OTHER than docker-compose.demo.yml.
Catches the placeholder-credential-leak regression one layer earlier
than the runtime Validate() fail-closed guards from Bundle 2 (2026-05-12).
Excludes comment lines so docs explaining the pattern don't trip the
guard. Verified to fire on a synthetic leak; clean on the current tree.

Consolidated 'Security carve-outs' doc section
==============================================

docs/operator/security.md grows by one new section documenting the
seven existing carve-outs in one canonical place:
  - SEC-M3: 3 InsecureSkipVerify=true sites (Agent dev, verify probe, tlsprobe)
  - SEC-M5: F5 connector InsecureSkipVerify per-config field
  - SEC-M4: ACME insecure + new ACK gate
  - SEC-L1: CSP 'unsafe-inline' on style-src (Tailwind carve-out)
  - SEC-L2: break-glass Argon2id rest-defense reminder
  - SEC-L3: 1 MB body-size cap + CERTCTL_MAX_BODY_SIZE override
  - DEPL-M2: change-me-* placeholder credentials in demo overlay
  - DEPL-M3: K8s NetworkPolicy operator-opt-in default

Each entry cites the file:line, the rationale for the carve-out, and
the operator action.

CHANGELOG + ENVIRONMENTS coverage
==================================

CHANGELOG.md grows by one new '### Breaking changes (scheduled for
v2.2.0)' section under Unreleased, documenting SEC-H1 / SEC-M4 / SEC-H3
with explicit upgrade-window guidance for each.

deploy/ENVIRONMENTS.md adds five rows: AGENT_BOOTSTRAP_TOKEN +
AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY + DEMO_MODE_ACK + DEMO_MODE_ACK_TS +
ACME_INSECURE_ACK. G-3 env-docs-drift CI guard stays clean.

WORKSPACE-ROADMAP.md (cowork-side) schedules the SEC-H1 default-flip
for v2.2.0.

Sandbox limitation
==================

The certctl repo's working tree is 6.1 GB which fills the sandbox
volume; the go1.25.10 toolchain download (go.mod requires it,
sandbox has 1.25.9) keeps failing on disk-full. Local 'go build' /
'go test' were NOT run in this commit's verification path.
make verify MUST be run on the operator's workstation before push
per CLAUDE.md operating rules.

CI guards (no-change-me, G-3 env-docs-drift, doc-rot-detector, +
all existing) verified clean by running each individually.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SEC-H1,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-H3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M4,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H1,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M2,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M5,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L1,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L2,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L3
2026-05-13 19:50:00 +00:00
shankar0123 0161bb201c docs: remove internal engineering docs; docs must be tool- or story-relevant
Operator policy: docs in the public repo must help (a) a user
deploying certctl or (b) the product story. Internal engineering
process documentation belongs in cowork/ scratchpads or in git
commit history, not docs/.

Removed (docs/contributor/, 8 files, 2,323 lines):
  - release-sign-off.md         — internal release-day checklist
  - ci-pipeline.md              — what runs in CI (internal)
  - ci-guards.md                — what the guards are (internal)
  - testing-strategy.md         — internal testing strategy
  - qa-test-suite.md            — internal QA reference (445 lines)
  - qa-prerequisites.md         — internal QA setup
  - gui-qa-checklist.md         — manual GUI QA checklist
  - test-environment.md         — 1,103-line redundant with
                                  docs/getting-started/quickstart.md +
                                  docs/getting-started/advanced-demo.md

Removed supporting script:
  - scripts/qa-doc-seed-count.sh — CI guard for the deleted
                                   qa-test-suite.md seed-data table

Cross-reference cleanup:
  - README.md: dropped the Contributor audience row + footer
    pointer to docs/contributor/.
  - Makefile: dropped `verify-docs` target + qa-stats comment refs.
  - .github/workflows/ci.yml: dropped the QA-doc seed-count drift
    CI step + dead comment refs.
  - docs/reference/cli.md: repointed qa-prerequisites.md → quickstart.md.
  - docs/operator/performance-baselines.md: dropped ci-pipeline.md
    cross-ref.
  - scripts/ci-guards/README.md: dropped the 'Guards explicitly
    NOT here' section that referenced the deleted QA-doc guards.

G-3 env-docs-drift guard improvements (a real consequence: deleting
the contributor docs surfaced that some env vars only had a home
there). Refit the guard to the new doc topology:
  - Defined-scan widened from `config.go + cmd/*` to all of `cmd/ +
    internal/` (production code), excluding `*_test.go` — catches
    service-layer env vars like CERTCTL_STEPCA_ROOT_CERT and
    CERTCTL_ZEROSSL_EAB_URL that were previously invisible to the
    guard.
  - Docs-scan widened to include deploy/ENVIRONMENTS.md (the
    canonical env-var inventory table — should have been in scope
    from day one). Kept narrow to README + docs/ + deploy/helm/ +
    ENVIRONMENTS.md to avoid pulling in compose/test fixtures.
  - ALLOWED filter now applies to both DOCS_ONLY and CONFIG_ONLY
    directions, so dynamic per-profile dispatch surfaces
    (CERTCTL_SCEP_PROFILE_<NAME>_*, CERTCTL_EST_PROFILE_<NAME>_*,
    CERTCTL_QA_*) don't need static doc entries.
  - Added CERTCTL_SCEP_PROFILE_[A-Z_]+ and CERTCTL_EST_PROFILE_[A-Z_]+
    to ALLOWED for the same reason.

deploy/ENVIRONMENTS.md: added CERTCTL_ZEROSSL_EAB_URL row — real
operator override (overrides the ZeroSSL EAB-credentials endpoint;
read at internal/connector/issuer/acme/acme.go:372) that was
defined in Go source but never documented. G-3 caught it after the
defined-scan widened.

scripts/ci-guards/S-1-hardcoded-source-counts.sh: removed dead
WORKSPACE-CHANGELOG.md allowlist entry (the file was deleted in
the prior workspace cleanup).

Verified:
  All 35 scripts/ci-guards/*.sh green (FAIL=0).
  No remaining references to docs/contributor/ or qa-doc-seed-count
  in tracked files.
2026-05-13 02:44:27 +00:00
shankar0123 a849c8b8cf fix(security): close BUNDLE 2 — safe first run, demo mode, agent bootstrap
Bundle 2 closure (2026-05-12 acquisition diligence audit). Closes the
"docker compose up == accidental production" hazard: pre-Bundle-2 the
base deploy/docker-compose.yml WAS the demo path (AUTH_TYPE=none +
DEMO_MODE_ACK=true + KEYGEN_MODE=server + DEMO_SEED=true + literal
change-me-... placeholder creds), the README claimed "drop the demo
overlay for a clean install", and ENVIRONMENTS.md table documented
auth-type default as api-key — three contradictory stories layered on
the same compose file.

Source findings closed:
  R2 R3 C1 D9 finding-2 S9               (repo audit)
  SEC-H2 SEC-M1 SEC-M3 OPS-M3 LOW-5 HIGH-6 (cowork audit)

Compose split (deploy/docker-compose.yml + deploy/docker-compose.demo.yml):
The base now ships production-shaped — no AUTH_TYPE override, no
KEYGEN_MODE override, no DEMO_MODE_ACK, no DEMO_SEED, no literal
placeholder fallbacks. POSTGRES_PASSWORD / CERTCTL_AUTH_SECRET /
CERTCTL_CONFIG_ENCRYPTION_KEY / CERTCTL_API_KEY / CERTCTL_AGENT_ID
must come from deploy/.env (sample template in deploy/.env.example +
root .env.example). The demo overlay carries the full demo posture
(every env var + every placeholder credential) so the
`-f docker-compose.demo.yml` one-flag flip remains a zero-config
populated-dashboard path.

Fail-closed startup guards (internal/config/config.go::Validate):
Three new gates layered on the existing HIGH-12 demo-mode listen-bind
guard. All three exempt CERTCTL_DEMO_MODE_ACK=true so the demo overlay
keeps working:
  • HIGH-6:  AUTH_SECRET = "change-me-in-production"        → refuse
  • HIGH-6:  CONFIG_ENCRYPTION_KEY = "change-me-32-char..." → refuse
  • LOW-5:   CORS_ORIGINS contains "*"  (CWE-942 + CWE-352) → refuse

Visible DEMO MODE banner (cmd/server/main.go): every boot under
DEMO_MODE_ACK=true now emits a prominent WARN line with a 6-step
production-promotion checklist. The 2026-04-19 incident (a screenshot
run that kept running for three days) drove this; the per-startup
banner makes the posture unmissable in any log scraper.

Agent enrollment doc alignment:
  • docs/reference/configuration.md L83: corrected the non-existent
    URL `POST /api/v1/agents/register` to the real route
    `POST /api/v1/agents`; added the bootstrap-token note and the
    install-agent.sh handoff sequence.
  • docs/reference/architecture.md L154: replaced "agents register
    themselves at first heartbeat" (false — cmd/agent/main.go fail-
    fasts when CERTCTL_AGENT_ID is unset) with the actual two-step
    operator-driven flow (REST or GUI registration first, returned ID
    fed to install-agent.sh second).

Tests + CI guard:
  • 9 new TestValidate_Bundle2_* cases in internal/config/config_test.go
    covering: placeholder-secret refused + demo-ack exempt; placeholder
    encryption-key refused + demo-ack exempt; real key not mistaken for
    placeholder; wildcard CORS refused + demo-ack exempt; wildcard mixed
    into a concrete allowlist still refused; concrete allowlist accepted.
  • scripts/ci-guards/B2-compose-base-no-demo-env.sh: greps the base
    compose for any of the demo-mode env vars + placeholder credentials.
    Comments stripped before checking so the narrative header in the
    base file can still reference the overlay's posture in prose.

Cold-DB CI smoke (.github/workflows/ci.yml::cold-db-compose-smoke):
Switched to layering -f docker-compose.demo.yml on top of the base —
the new production base requires real env vars the smoke doesn't have,
and the smoke's purpose (catch migration-on-cold-DB regressions + the
bootstrap-token mint path) is orthogonal to which auth posture the
boot lands in.

Receipts:
  • Current first-run truth table
        compose flag                                  → posture
        -f docker-compose.yml                          (production)
                                                       → requires .env;
                                                       fail-fasts on
                                                       missing AUTH_SECRET
                                                       / CONFIG_ENCRYPTION
                                                       _KEY / POSTGRES
                                                       _PASSWORD; agent
                                                       fail-fasts on
                                                       missing AGENT_ID
        -f docker-compose.yml -f docker-compose.demo.yml  (demo)
                                                       → zero-config;
                                                       AUTH_TYPE=none +
                                                       DEMO_MODE_ACK=true
                                                       + KEYGEN=server +
                                                       DEMO_SEED=true;
                                                       boot banner WARN
        -f docker-compose.yml -f docker-compose.dev.yml   (dev)
                                                       → base + PgAdmin
                                                       + debug logging
        -f docker-compose.test.yml                     (test, standalone)
                                                       → production-shape
                                                       posture, real CA
                                                       backends
  • Verification (PATH=/tmp/go/bin export GO* paths to /tmp):
        gofmt -l                                      # clean (no diffs)
        go vet ./internal/config ./cmd/server         # clean
        go test -short -count=1 ./internal/config/... # PASS (cumulative +
                                                       all 9 new Bundle 2
                                                       cases green)
        go test -short -count=1                       # PASS (no regression
            ./internal/connector/target/configcheck    in the Bundle 1 -
                                                       closure tests)
        go build ./cmd/server ./cmd/agent             # clean
            ./cmd/cli ./cmd/mcp-server
        bash scripts/ci-guards/B2-compose-base-no-demo-env.sh  # clean
        bash scripts/ci-guards/H-1-encryption-key-min-length.sh # clean
        bash scripts/ci-guards/G-3-env-docs-drift.sh           # clean

Remaining operator warnings (not blocking; tracked in CLAUDE.md
"Open decisions"):
  • The first `docker compose -f docker-compose.yml up -d` against a
    pre-Bundle-2 .env (placeholder values still in place) will now
    fail-fast. This is the intended posture but operators upgrading
    from v2.0.x via .env-from-old-master need to rotate before
    upgrading. The CHANGELOG note for the v2.1.0 release should
    call this out alongside Auth Bundle 2's other breaking changes.

Audit-Closes: BUNDLE-2 R2 R3 C1 D9 S9 SEC-H2 SEC-M1 SEC-M3 OPS-M3 LOW-5 HIGH-6
2026-05-13 00:14:59 +00:00
shankar0123 1720e11109 docs: fix broken single-file demo invocation in README + qa-prerequisites + ENVIRONMENTS
The README's Quick Start, the qa-prerequisites contributor doc, and the
landing page (separate repo, separate commit) all shipped a copy-paste
command that produces:

  service "certctl-server" has neither an image nor a build context
  specified: invalid compose project

The bug landed silently with commit a3d8b9c (the U-3 master). Pre-U-3,
docker-compose.demo.yml was self-contained and could be invoked with a
single -f flag. U-3 deliberately reduced it to a 27-line overlay — its
only payload today is `CERTCTL_DEMO_SEED=true` on the certctl-server
service — because the demo seed now applies at boot via
postgres.RunDemoSeed, not via /docker-entrypoint-initdb.d/. The overlay
no longer carries an image: or build: of its own, so it MUST be passed
alongside the base file.

The README/qa-doc/landing-page never picked up the rename of the contract.
Every operator who copy-pasted the Quick Start since U-3 has hit the
"invalid compose project" error and bounced. The operator caught it
running the demo locally today.

This commit fixes the three certctl-repo sites:

  README.md (Quick Start)
    docker compose -f deploy/docker-compose.demo.yml up -d --build
    →
    docker compose -f deploy/docker-compose.yml -f deploy/docker-compose.demo.yml up -d --build

    Plus the "drop the -f flag for clean install" prose now spells out
    the correct fallback (`-f deploy/docker-compose.yml` alone).

  docs/contributor/qa-prerequisites.md (Step 1)
    Same single-file → two-file fix, plus an inline note explaining
    why the override-only file requires the base (so the next person
    who reads it understands the contract instead of re-discovering it).

  deploy/ENVIRONMENTS.md (Demo Overlay → What it adds)
    Replaced the stale "One line: mounts seed_demo.sql into PostgreSQL's
    init directory" claim — that hasn't been true since U-3 — with the
    accurate "One env var: CERTCTL_DEMO_SEED=true; server applies
    seed_demo.sql at boot via postgres.RunDemoSeed" description, plus
    the historical context for why the overlay can't stand alone.

The certctl.io landing page hits the same bug (line 759); fix shipping
in a separate commit in that repo.

Acceptance gate (manual):
  - copy/paste the new README Quick Start command end-to-end against
    a fresh clone — succeeds, dashboard at https://localhost:8443
    shows the seeded demo data within ~30s.
  - clean-install fallback (`docker compose -f deploy/docker-compose.yml
    up -d --build`) starts a working stack with no demo data.
2026-05-05 20:55:26 +00:00
shankar0123 0729ee46e0 chore: sweep github.com/shankar0123/certctl URL refs to certctl-io/certctl
Post-transfer cosmetic + release-critical URL refresh after moving the
repo from github.com/shankar0123/certctl to github.com/certctl-io/certctl
(2026-05-03). GitHub HTTP redirects continue to forward old URLs forever,
so existing operators are not broken — but aligns the canonical
references with the new owner so:

- procurement engineers / contributors browsing the docs see the right
  URL on first read
- operators copying the agent install one-liner hit the new path
  directly without going through a redirect
- the Helm chart's default image repository points at the canonical org
  registry path
- the OnboardingWizard rendered to first-run UI users shows the new
  URL in the install snippets and doc anchor links
- the GitHub Actions release workflow pushes container images to
  ghcr.io/certctl-io/certctl-{server,agent} (was: shankar0123)
- the release-notes Markdown body in release.yml — which gets stamped
  into every future release page — references the post-transfer
  cert-identity (cosign keyless signing now uses the certctl-io
  workflow URL) and the post-transfer SLSA provenance source-uri.
  Without this, every cosign verify / slsa-verifier command on a
  v2.1.0+ release would fail because the cert-identity-regexp would
  not match the signing identity GitHub Actions OIDC issues post-
  transfer. Old releases (v2.0.67 and earlier) keep their immutable
  release-notes pointing at the shankar0123 path and remain
  verifiable via their own published instructions.

Customer impact:
- Operators on ghcr.io/shankar0123/certctl-{server,agent}:latest
  silently freeze on whatever tag was current at transfer time. They
  get no errors; they just stop receiving updates. The next release
  notes need a one-line callout (Phase 3.1 of cowork/transfer-
  certctl-to-org.md) telling them to update their image path to
  ghcr.io/certctl-io/certctl-{server,agent}.
- All other URLs (git clone, install one-liner, raw.githubusercontent
  URLs, browser links, GitHub API) continue to resolve via permanent
  HTTP redirects. The sweep is cosmetic for those.

Files swept (30 total):
  .github/workflows/release.yml — IMAGE_NAMESPACE, source-uri,
    cosign cert-identity-regexp, IMAGE= snippet (5 refs total).
  CHANGELOG.md, README.md — anchor links, badges, install one-liner,
    cosign verify snippets in operator-facing sections.
  api/openapi.yaml — info / externalDocs URLs.
  install-agent.sh — GITHUB_REPO const + systemd unit Documentation=
    field.
  deploy/ENVIRONMENTS.md, deploy/helm/{CHART_SUMMARY,INDEX,
    INSTALLATION,README}.md, deploy/helm/certctl/{Chart.yaml,
    README.md,values.yaml}, deploy/helm/examples/values-*.yaml —
    chart docs + image repository defaults across dev / prod-ha
    overrides.
  docs/{certctl-for-cert-manager-users,connector-iis,connectors,
    migrate-from-acmesh,migrate-from-certbot,quickstart,test-env,
    why-certctl}.md — operator-facing doc URLs.
  examples/{acme-nginx,acme-wildcard-dns01,multi-issuer,
    private-ca-traefik,step-ca-haproxy}/docker-compose.yml +
    examples/step-ca-haproxy/step-ca-haproxy.md — example image:
    paths and accompanying narrative.
  web/src/pages/OnboardingWizard.tsx — first-run-UI URL refs (curl
    install one-liners, agent docker image path, doc anchor links).

Files intentionally NOT swept (Choice A from cowork/transfer-certctl-
to-org.md):
  go.mod, go.sum — module declaration stays github.com/shankar0123/
    certctl. Existing imports compile because Go uses the path
    declared in go.mod, not the URL it was fetched from. Internal-
    only project; no external Go consumers; rename will land as a
    mechanical sed when one materializes.
  ~250 *.go files — every import remains github.com/shankar0123/
    certctl/internal/...
  deploy/test/f5-mock-icontrol/go.mod — separate test sub-module;
    same Choice A logic; module path stays.

Files intentionally NOT swept (other reasons):
  README.md lines 244-245 — Scarf-pixel docker-pull commands.
    shankar0123.docker.scarf.sh/... is a Scarf-account hostname
    (per-user, not per-repo) and the pixel keeps tracking pulls
    against the operator's personal Scarf account. Migrating to a
    certctl-io Scarf account is a separate decision (create org
    Scarf account → re-create package → update README).
  deploy/test/f5-mock-icontrol/f5-mock-icontrol — checked-in
    compiled binary with shankar0123/certctl baked into Go build
    info via the sub-module path. Out of scope for a URL sweep;
    will refresh on the next `make test-integration` rebuild.

Verification:
  gofmt: clean (no .go files touched).
  go vet ./...: clean (verified at this SHA in 1.3 of the transfer
    checklist; no .go changes since).
  go build ./...: clean (same).
  go test -short on representative packages: green (same).
  Diff shape: 30 files, 74 insertions / 74 deletions, net-zero size,
    pure URL substitution.
2026-05-03 23:39:50 +00:00
shankar0123 a91197014f fix(db): emit volume-state guidance on postgres auth failure (U-1, #10)
The shipped quickstart instructs operators to copy deploy/.env.example to
deploy/.env, edit POSTGRES_PASSWORD, and run docker compose up. On the
*first* boot of a fresh checkout this works. On the *second* boot — i.e.,
when an operator first booted with the default POSTGRES_PASSWORD=certctl,
then edited .env and re-ran up — the certctl-server container picks up the
new password (env interpolated at every container start) but postgres does
not. The postgres docker-entrypoint runs initdb only when the data dir is
empty; on subsequent boots the persistent named volume postgres_data is
non-empty so pg_authid retains the password baked in on first boot. The
server connects with the new credentials, postgres rejects them, and the
operator sees an opaque `pq: password authentication failed for user
"certctl"` in the server log with no pointer to the actual cause. New-
operator onboarding gets blocked on the documented production path.

Why a doc fix alone is not sufficient. Operators don't reread the docs
after a successful first boot — the trap fires on the *second* up, when
they think they've already learned the system. The opaque pq error is
indistinguishable in the log from a typo'd password or a misconfigured
secret store. The diagnostic has to fire at the moment the failure is
observed.

Why we don't try to fix the bootstrap. The env-vs-pg_authid divergence is
intrinsic to how the official postgres image bootstraps (see
docker-entrypoint.sh: initdb runs only if PGDATA is empty). Switching to a
bind mount or ephemeral volume breaks the production path; switching to
POSTGRES_PASSWORD_FILE + ALTER ROLE adds operator surface without
eliminating the divergence. The ergonomic fix is to surface the failure
mode loudly, with both remediation paths, at the exact log line where it
becomes visible.

Two remediation paths, surfaced together. Destructive: `docker compose
-f deploy/docker-compose.yml down -v && up -d --build` — wipes the
postgres volume so initdb re-runs with the new env value. Use this on
demos / first-time setup where data loss is acceptable. Non-destructive:
`docker compose exec postgres psql -U certctl -c "ALTER ROLE certctl
PASSWORD '<new>';"` followed by a server restart with the matching
POSTGRES_PASSWORD. Use this on any environment that holds data you want
to keep. Surfacing both means the operator can pick based on their
environment without us assuming.

Files changed:

- internal/repository/postgres/db.go — extract wrapPingError(err) helper.
  errors.As against *pq.Error; on SQLSTATE 28P01 (invalid_password) emit
  the multi-line guidance preserving the %w wrap chain. Non-28P01 errors
  retain the original `failed to ping database: %w` shape so transient
  connection-refused / timeout paths don't get noisy. Add
  pgErrInvalidPassword = "28P01" constant. Convert blank
  `_ "github.com/lib/pq"` import to direct import (driver registration
  still works via init()) so we can name the *pq.Error type at compile
  time. NewDB now calls wrapPingError(err) instead of inlining the wrap.
- internal/repository/postgres/db_test.go (new) — 4 internal-package
  unit tests covering wrapPingError. AuthFailureGuidance pins the
  contract substrings ("SQLSTATE 28P01", "POSTGRES_PASSWORD",
  "first boot", "down -v", "ALTER ROLE"). NonAuthErrorPreservesOriginalWrap
  pins the no-leak contract for SQLSTATE 08006 (connection_failure).
  NonPqErrorPreservesOriginalWrap pins the network-level path.
  NilReturnsNil pins defensive contract. All run in -short without
  testcontainers — package postgres (internal) so the unexported helper
  is callable directly.
- docs/quickstart.md — `> **Warning:**` callout immediately after the
  `cp deploy/.env.example deploy/.env` block at lines 56-61. Names the
  trap, names the SQLSTATE, gives both remediation paths. Uses the
  in-file `> **Note:**` blockquote convention.
- deploy/ENVIRONMENTS.md — `**Stateful volume — first-boot password
  binding (U-1)**` paragraph appended to the Postgres expert-note block.
  Explains the env-vs-pg_authid divergence, points at wrapPingError as
  the runtime diagnostic, lists both remediation paths. Uses the in-file
  `**Expert note:**` convention.

Out of scope (separate follow-ups):

- deploy/helm/certctl/templates/postgres-statefulset.yaml has the same
  root cause via PVC retention. The wrapPingError diagnostic covers the
  Helm path because the same NewDB code runs at server startup; the
  Helm-specific doc warning lands separately.
- /.env.example at repo root (line 16 hardcodes the password literally
  inside CERTCTL_DATABASE_URL rather than interpolating) — adjacent
  trap, separate fix.
- examples/{acme-nginx,private-ca-traefik,step-ca-haproxy,multi-issuer,
  acme-wildcard-dns01}/docker-compose.yml all carry the pattern. The
  diagnostic covers them; targeted doc warnings are scoped to the
  canonical quickstart + ENVIRONMENTS docs.

Out of consideration:

- Switch to bind mount / ephemeral volume — breaks the production path.
- POSTGRES_PASSWORD_FILE + Docker secret + ALTER ROLE rotation — adds
  operator surface without fixing the env-vs-pg_authid divergence.

Verification (all passing):
- go build ./...
- go vet ./...
- go test -short -race ./internal/repository/postgres/ — 4/4 new tests
  pass plus existing tests
- go test -short ./... — every package green
- govulncheck ./... — no vulnerabilities in our code
- wrapPingError coverage 100%; postgres pkg total unchanged in shape
  (NewDB/RunMigrations were 0% pre-fix, still 0% post-fix; new helper
  adds 100%-covered statements)

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap
      GitHub Issue #10 (mikeakasully)
2026-04-24 23:21:26 +00:00
shankar0123 52248be717 v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP
Breaking change release. Plaintext HTTP listener removed. The certctl
control plane now terminates TLS 1.3 on :8443 via
http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape
hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md.

Server
- cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert
  swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback),
  preflightServerTLS validation
- cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe,
  watchSIGHUP wiring, cert/key path config threading
- tls_test.go: 418-line regression coverage of reload, preflight,
  callback behavior, SAN validation

Config
- CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required)
- Plaintext rejection: agents/CLI/MCP pre-flight-fail on http://
  URLs with a pointer to docs/upgrade-to-tls.md

Agents, CLI, MCP
- All three pre-flight-reject http:// URLs with fail-loud diagnostic
- CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust
- CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass
  (loud warning on startup)
- install-agent.sh emits both vars as commented template lines

docker-compose
- certctl-tls-init sidecar generates SAN-valid self-signed cert into
  deploy/test/certs/ on first boot
- All demo-stack curls pin against ca.crt with --cacert

Helm chart
- Three TLS provisioning modes, exactly one required:
  - server.tls.existingSecret (operator-supplied)
  - server.tls.certManager.enabled (cert-manager integration)
  - server.tls.selfSigned.enabled (eval only — not for production)
- server-certificate.yaml template for cert-manager mode
- helm install without a TLS source fails at template render with
  a pointer to docs/tls.md

CI
- .github/workflows/ci.yml Helm Chart Validation step renders the
  chart in both existingSecret and cert-manager modes, plus an
  inverse guard-regression test that asserts helm template MUST
  refuse to render when no TLS source is configured. Previously
  the single `helm template` invocation hit the certctl.tls.required
  fail-loud guard and exit-1'd CI. Four invocations now: lint
  (existingSecret), template (existingSecret), template
  (cert-manager), template (no args — must fail).

Integration tests
- deploy/test/integration_test.go stands up the Compose stack over
  HTTPS, extracts the CA bundle, and exercises every certctl API
  over https://localhost:8443
- All 34 integration subtests green (per Phase 8 local CI-parity)

Documentation
- New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload)
- New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade
  warnings, fleet-roll sequencing)
- CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry
  (file heading unchanged; release tag is v2.0.47)
- All curls in docs/, examples/, deploy/helm/ guides use
  https://localhost:8443 --cacert

Verification
- grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits
- grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin
  API default, SSRF doc comment) — zero certctl endpoints
- Tasks #197–#206 (Phases 0–8) all closed in the tracker

Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix).
2026-04-20 03:43:10 +00:00
shankar0123 c6efa4ab39 docs: add Docker Compose environments guide and fix compose files
- New deploy/ENVIRONMENTS.md: comprehensive walkthrough of all 4 compose
  files with service-by-service explanations, beginner-friendly Docker
  concepts, and expert-level networking/config details
- Fix docker-compose.dev.yml: agent LOG_LEVEL → CERTCTL_LOG_LEVEL (was
  silently ignored without the CERTCTL_ prefix)
- Add CERTCTL_CONFIG_ENCRYPTION_KEY to base and test compose (enables
  M34/M35 dynamic issuer/target config encryption)
- Add CERTCTL_DISCOVERY_DIRS to base compose agent (enables filesystem
  certificate discovery in default deployment)
- Cross-link ENVIRONMENTS.md from README doc table and quickstart.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 21:57:17 -04:00