24 Commits

Author SHA1 Message Date
shankar0123 569aea255f fix(helm): servicemonitor.yaml — Go templates don't support nested comments (B3 ci-guard)
c70bb07 was incomplete. Replacing the YAML `#` comment block with a
Helm `{{- /* ... */ -}}` comment block was correct, but the NOTE
section I added explaining the syntax contained the literal
characters `*/ -}}` (it described the comment-syntax in prose).

Go templates DO NOT support nested comments. The lexer scans forward
from `{{- /*` looking for the FIRST `*/}}` or `*/ -}}` token and
treats whatever it finds as the comment terminator. So the literal
`*/ -}}` sequence inside my explanatory NOTE closed the comment
early, exposing the trailing narrative (which contained `{{ ... }}`
as descriptive text about template actions) as live YAML. Helm's
template engine then parsed `{{ ... }}` literal text as a real
template action whose body is `...` — `unexpected <.> in operand`
at servicemonitor.yaml:26.

Verified locally with helm 3.16.0 + the B3-helm-chart-coherence
ci-guard:
  B3-helm-chart-coherence: clean (default + external-Postgres +
  cert-manager + production hardening + 3 fail-fast gates +
  DEPL-003 viaHook env render all green).

Fix: rewrote the NOTE without the literal closing-syntax `*/ -}}`
characters and without the `{{ ... }}` action-delimiter examples.
The narrative now points operators at docs/operator/helm-deployment.md
for the full explanation rather than inlining template-action examples
into the chart-template comment block.

Lesson update: descriptive references to Helm template actions inside
chart templates must live in Helm-comment blocks (correct) AND those
comment blocks must not contain the literal closing-delimiter sequence
`*/ -}}` as text (also correct). When in doubt, narrate the rule from
the operator-facing doc, don't inline syntax examples in chart-template
comments.
2026-05-16 22:48:47 +00:00
shankar0123 c70bb071f9 fix(helm): DEPL-004 follow-up — Helm-comment block for tlsConfig narrative (B3 ci-guard)
Commit 9155ec9 introduced a YAML `#` comment block above the
tlsConfig branch that referenced `{{ if ... }}` and `{{ fail }}`
as literal text. Helm's template engine scans for `{{ ... }}`
action delimiters everywhere in the source — it does NOT respect
YAML `#` comments. So Helm parsed the multi-line sequence

    {{ if .Values.monitoring.
    # serviceMonitor.tlsConfig }}

as a single template action containing an invalid `#` token,
which aborted the WHOLE chart render with:

  Error: parse error at (certctl/templates/servicemonitor.yaml:51):
  unexpected <.> in operand

That's why all five B3-helm-chart-coherence render modes (default,
external-Postgres, production-hardening, sessionAffinity, viaHook)
failed simultaneously on f7fcd1e — the parse error fires before
any mode-specific values get applied.

Fix: replace the YAML `#` block with a Helm `{{- /* ... */ -}}`
comment block. Helm strips the comment body before template
execution, so descriptive references to `{{ if ... }}` /
`{{ fail }}` inside the comment are safe. Also rewrote the
`{{ fail }}` message string to drop the inline backtick-quoted
`{ insecureSkipVerify: true }` shape (literal `{` could have
re-tripped the same scanner) in favor of `insecureSkipVerify=true`.

Lesson: descriptive references to Helm template actions inside
chart templates MUST live in Helm-comment blocks, never in YAML
comments. The G-3-env-docs-drift fix in f7fcd1e is unaffected —
this is purely the B3-helm-chart-coherence regression introduced
by 9155ec9.
2026-05-16 22:29:56 +00:00
shankar0123 9155ec9174 fix(helm): DEPL-004 follow-up — default tlsConfig to real verify; fix ill-formed required-nil
Sprint 6 ACQ DEPL-004 closure follow-up. CI run on commit 58a15e0
caught two issues:

1. The fail-closed guard in templates/servicemonitor.yaml used
   `{{ required "msg" nil }}`, which is wrong Helm syntax — the
   bareword `nil` isn't valid in Go templates and Helm interprets
   it as no value, hitting "wrong number of args for required:
   want 2 got 0". The B3-helm-chart-coherence ci-guard's
   production-hardening render
   (`--set monitoring.serviceMonitor.enabled=true` without
   explicit tlsConfig) failed with this error AND with the
   downstream "missing kind: ServiceMonitor / PodDisruptionBudget /
   NetworkPolicy" cascades (the entire render aborted before
   producing the matrix).

2. The original DEPL-004 framing — "operators MUST explicitly
   choose tlsConfig or you get a chart-render error" — was the
   right intent but the wrong default. The chart's existingSecret
   integration mounts the CA bundle at a canonical path
   (/etc/prometheus/secrets/certctl-ca/ca.crt); defaulting to that
   path closes the implicit-skipVerify gap without forcing every
   operator to repeat the same boilerplate.

Fixes
=====

deploy/helm/certctl/values.yaml — flips
monitoring.serviceMonitor.tlsConfig from commented-out (which fell
through to implicit insecureSkipVerify: true) to a real verify
default:

  tlsConfig:
    caFile: /etc/prometheus/secrets/certctl-ca/ca.crt
    serverName: certctl-server

Operators with a different CA mount path override caFile;
operators who genuinely want skipVerify back must set
`{ insecureSkipVerify: true }` explicitly. Operators who blank
tlsConfig entirely (`tlsConfig: null` or `tlsConfig: {}`) still
trip the fail-closed guard.

deploy/helm/certctl/templates/servicemonitor.yaml — replaces
`required "msg" nil` with `fail "msg"`. The `fail` builtin is
the correct Helm pattern for an unconditional render-time error;
`required` is for "this value MUST be non-empty" which is the
wrong semantic here (we want to fail when the operator went OUT OF
THEIR WAY to blank the default). Failure message updated to
reflect the new default + the operator-action recipes.

docs/operator/helm-deployment.md — rewrites the
"2026-05-16 — ServiceMonitor TLS default flipped" subsection to
match the new default-on-real-verify semantics. The three operator
recipes (default install / different CA mount / explicit
skipVerify) are documented; the explicit "there is no way to
inherit pre-2026-05-16 implicit-skipVerify behavior silently"
guarantee is preserved.

Verified locally: python3 YAML parse on values.yaml clean; the
helm-templates-lint and B3-helm-chart-coherence ci-guards require
helm itself which isn't in the sandbox — both should pass on the
CI re-run.
2026-05-16 22:09:42 +00:00
shankar0123 d7546aedca fix(helm): DEPL-004 — ServiceMonitor TLS default flipped to fail-closed
Acquisition-audit DEPL-004 closure (Sprint 6 ACQ, 2026-05-16).

Pre-2026-05-16, monitoring.serviceMonitor.tlsConfig in values.yaml
was empty by default, and the ServiceMonitor template fell through
to an implicit `insecureSkipVerify: true` else-branch. Operators
opting into the ServiceMonitor (monitoring.serviceMonitor.enabled=true)
got no Prometheus TLS verification by default — in-cluster scrapes
tolerate this, out-of-cluster scrapes silently skip the chain check.

The template now emits a fail-closed `{{ required ... }}` message
at `helm template` / `helm upgrade` time if neither a real verify
nor an explicit opt-back is supplied. The error string lists both
escape hatches and the docs cross-link, so the operator sees the
fix in the same line they hit the error.

Operators with monitoring.serviceMonitor.enabled=false (the chart
default): no action required — the template short-circuits before
the tlsConfig block. Operators who had ServiceMonitor on with no
tlsConfig set: helm upgrade will fail until they supply either
{ caFile: ..., serverName: ... } (production-shaped) or
{ insecureSkipVerify: true } (operator-acknowledged opt-back).

Files
=====
- deploy/helm/certctl/templates/servicemonitor.yaml: replace the
  else-branch insecureSkipVerify default with a {{ required ... }}
  Helm builtin that fails the render with a clear remediation
  message pointing at both escape hatches and docs/operator/
  helm-deployment.md
- deploy/helm/certctl/values.yaml: rewrite the tlsConfig comment
  block to document the new fail-closed posture + both upgrade
  paths (production verify vs operator-acknowledged opt-back)
- docs/operator/helm-deployment.md: new "2026-05-16 — ServiceMonitor
  TLS default flipped (DEPL-004)" subsection in the existing
  Upgrade section with the two operator-action recipes
2026-05-16 19:44:48 +00:00
shankar0123 6a640ac3e7 fix(helm): DEPL-003 + DEPL-006 — render viaHook env, sessionAffinity, HA backend default
Sprint 3 unified-master-audit closure — two Helm-chart correctness
defects with overlapping CI-guard surface.

DEPL-003 — CERTCTL_MIGRATIONS_VIA_HOOK never rendered:
  Pre-fix the env var was documented in values.yaml and the
  migration-job.yaml comment but never made it into the server
  Deployment env block. With migrations.viaHook=true the operator's
  intent is 'the pre-install/pre-upgrade Helm Job owns migrations,'
  but the server pods, missing the env, ran their own
  cmd/server/migrations.go::runBootMigrations alongside the hook
  Job, racing on the schema lock.
  Fix: render '- name: CERTCTL_MIGRATIONS_VIA_HOOK / value: true'
  in server-deployment.yaml under '{{- if .Values.migrations.viaHook }}'.

DEPL-006 — HA example missing rate-limit backend + sessionAffinity:
  values-prod-ha.yaml sets replicas:3 but inherited the chart-wide
  default rateLimiting.backend=memory (which gives each pod its
  own bucket map, effectively tripling the cap on a 3-replica fleet)
  AND the chart had no render path for server.service.sessionAffinity
  even though docs/operator/runbooks/ha.md instructed operators to
  set it for ClientIP-routed sticky sessions.
  Fix:
    - server-service.yaml gains a conditional sessionAffinity +
      sessionAffinityConfig.clientIP.timeoutSeconds render.
    - values.yaml grows the matching schema entries (default empty
      so single-replica deploys are unaffected).
    - values-prod-ha.yaml flips rateLimiting.backend=postgres and
      service.sessionAffinity=ClientIP.
    - NOTES.txt emits a loud warning when replicas>1 + either toggle
      is still in the default state, so the misconfig surfaces at
      helm install time instead of in a confused login-flow bug
      report a week later.

CI:
  scripts/ci-guards/B3-helm-chart-coherence.sh gains 'Check 7'
  (DEPL-003 viaHook env render — both positive and negative —
  the inverse case catches future drift that drops the {{- if }}
  guard) and 'Check 8' (DEPL-006 sessionAffinity render). Both
  helm-template through to assert the rendered YAML carries the
  expected text.

Closes DEPL-003, DEPL-006.
2026-05-16 04:30:37 +00:00
shankar0123 a41fc2d75c feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete)
Phase 13 Sprint 13.3 — the completion half of the ARCH-M1
substantive close. Sprint 13.2 shipped the Postgres-backed
sliding-window limiter + multi-replica integration test; Sprint 13.3
wires the 6 call sites in cmd/server/main.go through the operator-
chosen backend selector, adds the rate_limit_buckets scheduler
janitor sweep, rewrites the observability doc, exposes the env-var
in the helm chart, and promotes the multi-replica integration test
to a required CI status check.

Signature ground-truth (sprint 13.2 + 13.3)
===========================================
Prompt-template signatures: `Allow(key string) error` and "5 call
sites." Actual repo: `Allow(key string, now time.Time) error` and 6
NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt
miscounted the second EST per-principal arm). Per CLAUDE.md "the repo
is truth," matched the live shape.

What changed
============

internal/config/server.go (+40 LOC):
  - Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval
    time.Duration` to RateLimitConfig with full operator-facing
    documentation of the two valid values (memory|postgres) +
    when-to-use-which decision tree.

internal/config/config.go (+27 LOC):
  - Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m).
  - Validate() rejects anything other than ""/"memory"/"postgres"
    (empty = memory equivalence for test-built Configs that bypass
    Load()). Janitor interval must be ≥ 1 minute when set.
  - Failure modes return clear ::error:: with the env-var name + the
    valid values, so an operator typo ("postgress" → memory in a
    3-replica cluster) fails fast at startup.

internal/ratelimit/factory.go (NEW, 67 LOC):
  - NewLimiter(backend, db, maxN, window, mapCap) Limiter — single
    factory the 6 cmd/server/main.go call sites route through.
  - Drop-in signature: same maxN/window/mapCap as
    NewSlidingWindowLimiter (mapCap accepted + ignored for postgres
    — the rate_limit_buckets table grows until the janitor sweeps).
  - Defensive panic on unknown backend (config.Validate is SoT;
    this is belt-and-suspenders).

internal/ratelimit/postgres_gc.go (NEW, 73 LOC):
  - PostgresGC struct + NewPostgresGC + GarbageCollect.
  - Single-statement DELETE FROM rate_limit_buckets WHERE
    updated_at < NOW() - maxWindow. Idempotent.
  - maxWindow <= 0 is a no-op (operator opt-out).

internal/scheduler/scheduler.go (+90 LOC):
  - New RateLimitGarbageCollector interface (mirrors the
    ACMEGarbageCollector / SessionGarbageCollector contracts).
  - rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning
    on Scheduler.
  - SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d)
    Setters following the existing acmeGC/sessionGC pattern.
  - rateLimitGCLoop() — JitteredTicker + atomic.Bool guard +
    per-tick context.WithTimeout(1m). Logs row count at Debug.
  - Loop counted in the Start() WaitGroup only when the GC is
    non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector
    when backend=memory so the loop never launches for that case.

cmd/server/main.go (35 LOC diff):
  - All 6 ratelimit.NewSlidingWindowLimiter call sites now route
    through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend,
    db, ...). Grep verification post-fix returns ZERO hits.
  - Six sites: breakglass loginLimiter (580), ocspLimiter (1003),
    exportLimiter (1068), EST failed-basic (1535), EST per-principal
    SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The
    intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved
    — its inner type-alias wrapper is the prompt's
    out-of-scope (cmd/server/*.go only).
  - Conditionally constructs PostgresGC + wires the scheduler janitor
    when backend=postgres; logs the wiring decision either way so
    operators see "rate-limit GC sweep enabled (postgres backend)"
    or "in-memory backend self-prunes" in the boot log.

internal/api/handler/{est,export,certificates,auth_breakglass}.go:
  - Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types
    with ratelimit.Limiter (the interface). Allow() satisfies the
    same call shape on both backends; the in-memory tests that
    construct *SlidingWindowLimiter still compile because the
    concrete type satisfies the interface (compile-time check in
    internal/ratelimit/limiter.go pins this).

docs/operator/observability.md (176 LOC diff):
  - Replaced the "per-process, in-memory, reset-on-restart, not
    shared across replicas" paragraph with the new
    configurable-backend section: operator decision tree,
    backend internals (memory vs postgres), janitor description,
    falsifiable closure proof (the Sprint 13.2 integration test
    name + invocation), helm chart wiring example.
  - Updated inventory to reflect the actual handler file paths +
    actual cap configurations (the prior doc said "60s window" for
    several limiters that actually use 60m / 24h windows).
  - Doc smoke confirmed: grep -c 'per-process, in-memory,
    reset-on-restart' docs/operator/observability.md = 0.

deploy/helm/certctl/values.yaml + templates/server-configmap.yaml +
templates/server-deployment.yaml:
  - Exposed server.rateLimiting.backend (default "memory") +
    server.rateLimiting.janitorInterval (default "5m") under the
    existing rateLimiting block.
  - ConfigMap renders both as rate-limit-backend +
    rate-limit-janitor-interval keys.
  - Deployment wires CERTCTL_RATE_LIMIT_BACKEND +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap.
  - Helm render: `helm template deploy/helm/certctl --set
    server.rateLimiting.backend=postgres` shows the env-var on the
    server-deployment.yaml output.

.github/workflows/ci.yml (+12 LOC):
  - Added a new step in the Go Build & Test job that runs the
    Sprint 13.2 multi-replica integration test
    (TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with
    -tags=integration -race -timeout=300s. Fails the CI status check
    if the cross-replica row lock ever stops arbitrating across
    replicas — the ARCH-M1 closure regression gate.

Verification (all green locally; postgres integration via CI)
============================================================

  $ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go
    (zero hits — Sprint 13.3 receipt)

  $ go test -short -count=1 \
      ./internal/config/... ./internal/ratelimit/... \
      ./internal/scheduler/... ./internal/api/handler/... \
      ./cmd/server/...
    ok  internal/config       1.177s
    ok  internal/ratelimit    0.007s
    ok  internal/scheduler    9.165s
    ok  internal/api/handler  6.245s
    ok  cmd/server            0.390s

  $ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \
      ./internal/config/... ./internal/api/handler/... ./cmd/server/...
    (clean)

  $ gofmt -l internal/ cmd/server/
    (clean)

  $ grep -c 'per-process, in-memory, reset-on-restart' \
      docs/operator/observability.md
    0   (doc smoke — the audit's verbatim phrasing is gone)

  $ bash scripts/ci-guards/G-3-env-docs-drift.sh
    G-3 env-docs-drift: clean.

  $ bash scripts/ci-guards/complete-path-config-coverage.sh
    OK — every CERTCTL_* env var (197) has at least one non-config-
    package consumer.

Selector contract verified — config.Validate() rejects any value
other than ""/memory/postgres at startup with a clear error message.

Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a
different axis; ARCH-M1 closure is complete with this commit
modulo the Sprint 13.7 audit-HTML flip + zero-floor pin.

Closes: ARCH-M1 substantive remediation. The cross-replica rate-
limit-cap-enforcement gap that the audit recommended deferring to
v3 is closed; operators with server.replicas > 1 flip
CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement
across the cluster (proved by the multi-replica integration test now
gating CI).
2026-05-14 11:52:13 +00:00
shankar0123 8191b1ee64 scheduler+db: close Phase 6 — scale hardening across pool, jitter, ETag, asyncpoll
Phase 6 of the certctl architecture diligence remediation. Five
findings across the same scheduler-and-DB-pool surface.

SCALE-M1 (Med) — DB pool default bumped 25 → 50
  internal/config/config.go line 1972:
    MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50)
  Postgres default max_connections is 100; 50 leaves headroom for
  pg_dump + ad-hoc psql + a server replica without exhausting the
  DB-side cap. Operator override env var unchanged. Operator-tune
  ladder for larger fleets (5K / 50K certs) lives in
  docs/operator/scale.md as starter values pending Phase 8 load
  tests — explicitly marked TBD.

SCALE-M3 (Med) — async-CA poll budget operator-configurable
  Live state was partially-already-shipped: all 4 async-CA
  connectors (digicert, entrust, globalsign, sectigo) already have
  per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5
  closed pre-Phase-6). What was missing: a global package-default
  override. Shipped:
    - internal/connector/issuer/asyncpoll/asyncpoll.go gains
      SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the
      currentDefaultMaxWait() priority resolver.
    - cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS
      at boot and calls SetDefaultMaxWait.
    - deploy/ENVIRONMENTS.md documents the new env var (G-3 guard
      green).
  Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS:
  the live code tracks wall-clock time (MaxWait), not attempt count.
  Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS)
  so the priority chain reads naturally.

SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops
  internal/scheduler/jitter.go ships NewJitteredTicker(interval,
  jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in
  internal/scheduler/scheduler.go migrated from bare time.NewTicker
  to NewJitteredTicker(interval, DefaultSchedulerJitter). Base
  intervals unchanged; only the per-tick envelope adds ±10%
  randomized delay so multiple loops with the same nominal cadence
  don't co-fire and spike CPU + DB at wall-clock boundaries.

  internal/scheduler/jitter_test.go pins:
    - Bounded envelope (each tick within ±jitterPct of interval)
    - Mean drift < 30% of nominal (sign-bug detector)
    - Stop() releases the goroutine + closes C
    - Stop() idempotent (no panic on repeat)
    - Zero-jitter behaves like time.NewTicker
    - Negative and >=1 jitterPct values clamped defensively

  CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks
  any future bare time.NewTicker in scheduler.go.

SCALE-L1 (Low) — renewal-sweep semaphore behavior documented
  docs/operator/scale.md "Scheduler tick budgets" section explains
  the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25
  default), the ctx-cancellation drain on tick-budget overrun, and
  operator tuning advice (raise concurrency + DB pool together).
  No code change — the behavior is defensible as-is per the audit.

SCALE-L2 (Low) — ETag middleware for top-5 read endpoints
  internal/api/middleware/etag.go computes SHA-256 ETag over the
  buffered response body, respects If-None-Match, short-circuits
  to 304 Not Modified on match. GET/HEAD only; non-2xx responses
  pass through unchanged. 64 KiB buffer cap degrades gracefully on
  oversized responses (no caching, body still flushes intact).

  Wired around the top-5 read endpoints via etagged() helper in
  internal/api/router/router.go:
    GET /api/v1/certificates
    GET /api/v1/agents
    GET /api/v1/jobs
    GET /api/v1/audit
    GET /api/v1/discovered-certificates

  internal/api/middleware/etag_test.go pins 11 behaviors including
  304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass,
  4xx/5xx pass-through, oversized-response degradation, wildcard
  match, HEAD-treated-like-GET, byte-equal pass-through.

Cross-cutting fixes:
  - internal/config/config_test.go::TestLoad_DefaultValues updated
    to assert the new 50 default (was 25).
  - deploy/helm/certctl/values.yaml comment corrected — agent
    pollInterval is hardcoded 30s, not env-configurable; the
    Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL
    which G-3 caught as a phantom env var.
  - asyncpoll.go reformatted by gofmt; functionally unchanged.

Verification (all pass):
  grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go    # finds 1 site
  grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go  # config default is 50
  grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md  # wired
  grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go    # 0 (all migrated)
  grep -cE 'JitteredTicker' internal/scheduler/scheduler.go         # 15
  ls internal/scheduler/jitter.go internal/api/middleware/etag.go   # both exist
  ls docs/operator/scale.md                                          # exists
  bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh          # clean
  bash scripts/ci-guards/G-3-env-docs-drift.sh                      # clean
  go test ./internal/scheduler/ ./internal/api/middleware/ \
    ./internal/connector/issuer/asyncpoll/ ./internal/config/       # 4/4 packages green

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2
2026-05-14 01:23:03 +00:00
shankar0123 d6f4d5c5e8 deploy(helm): close Phase 4 — chart surface + DR + ops runbooks
Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.

DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
  Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl
  matching the canonical shape in
  docs/operator/runbooks/postgres-backup.md (so manual and
  automated dumps are byte-identical). Sink: PVC (default) OR S3
  via aws-cli. Documented as in-cluster-Postgres only — managed DB
  deployments rely on their provider's PITR.

DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
  deploy/helm/certctl/templates/migration-job.yaml — runs
  `certctl-server --migrate-only` before the server Deployment
  rolls. The --migrate-only flag (new in cmd/server/main.go) is a
  hermetic schema-mutation pass: load config, open DB pool, run
  RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
  no signing setup.

  Server's boot-time RunMigrations call is now gated on
  CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
  the boot path (the hook owns the work). Default still runs at
  boot, so Compose / VM / bare-metal deploys are unchanged.

  migrations.viaHook: false in values.yaml (off by default).

DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
  deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
    spec.updateStrategy.type: OnDelete
    spec.podManagementPolicy: OrderedReady
  Operator-controlled Postgres upgrades (the OnDelete strategy
  means a chart template tweak no longer triggers an immediate
  Postgres restart). OrderedReady aligns with the standard
  Postgres-on-Kubernetes pattern for any future HA work.

DEPL-M5 (Med) — per-fleet-size resource ladder documentation
  deploy/helm/certctl/values.yaml — extended comments next to
  server.resources + agent.resources documenting:
    "≤ 500 certs / 100 agents" → defaults are validated
    "5K certs / 1K agents" → starter suggestions, TBD Phase 8
    "50K certs / 10K agents" → starter suggestions, TBD Phase 8
  Numbers for the small-fleet case derive from the measured
  baselines in docs/operator/performance-baselines.md
  (50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
  fleet numbers explicitly marked TBD pending Phase 8 load-test
  runs — operators tune empirically until then.

DEPL-L1 (Low) — Helm rollback runbook
  docs/operator/runbooks/rollback.md — covers helm rollback
  mechanics, the schema-migration manual-cleanup path (when
  *.down.sql files apply vs. when full restore is the only safe
  path), and the per-migration-class safe-to-rollback table.

DEPL-L2 (Low) — Prometheus AlertManager rules
  deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
  monitoring.prometheusRules.enabled=true. Default OFF. Four
  starter rules using verified metric names from
  internal/api/handler/metrics.go:
    CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
    CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
    CertctlJobFailureRateHigh (failure rate over 5% for 15m)
    CertctlIssuanceFailures (any failures over 15m window)
  All thresholds operator-tunable via
  monitoring.prometheusRules.thresholds.* in values.

DEPL-L3 (Low) — Prometheus bearer-token setup runbook
  docs/operator/runbooks/prometheus-bearer-token.md — documents
  the API-key + Secret + values wiring for the RBAC-gated
  /api/v1/metrics/prometheus scrape endpoint. End-to-end
  procedure with troubleshooting steps + rotation guide.

CI guard: scripts/ci-guards/helm-templates-lint.sh
  Six-combo matrix: defaults / backup PVC / backup S3 /
  prometheusRules / migrations.viaHook / all-on. Each runs helm
  template + checks render success. helm lint also gated.
  Wired into the auto-pickup loop in .github/workflows/ci.yml;
  azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
  RED-2) installs helm v3.16.0 on the runner.

Verification (all pass):
  ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
  grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml  # 2 matches
  helm template deploy/helm/certctl/ --set backup.enabled=true \
    --set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
    | grep -E "kind: (CronJob|PrometheusRule|Job)"  # 3 matches
  helm lint deploy/helm/certctl/  # 0 failed
  ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
  bash scripts/ci-guards/helm-templates-lint.sh  # 6/6 matrix combinations pass

Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
2026-05-14 00:58:00 +00:00
shankar0123 69a2b5c55a config: default hardening + operator docs (Phase 2 closure — SEC-H1, SEC-H3, SEC-M4, DEPL-H1, DEPL-M2 + doc-only carve-outs)
Eleven findings from the architecture diligence audit's Phase 2 bundle
closed in one PR. All touch the same backend config + Helm chart +
operator docs surface, so reviewing in one diff is the natural fit.

config.go: three new fail-closed Validate() branches behind sentinels
=====================================================================

Three new error sentinels exported from internal/config/config.go for
tests to pin via errors.Is + message-text:
  - ErrAgentBootstrapTokenRequired (SEC-H1)
  - ErrACMEInsecureWithoutAck      (SEC-M4)
  - ErrDemoModeAckExpired          (SEC-H3)

SEC-H1 (staged): introduces CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY
as an opt-in feature flag. When true AND the bootstrap token is empty,
Validate() returns ErrAgentBootstrapTokenRequired and the server
refuses to start. Default in THIS release: false (warn-mode
pass-through preserved). WORKSPACE-ROADMAP.md schedules the default
flip to true for v2.2.0 — operators get one upgrade window.

SEC-M4: upgrades the existing boot-time WARN log for
CERTCTL_ACME_INSECURE=true into a hard refuse-to-start gate behind
CERTCTL_ACME_INSECURE_ACK=true. The ACK env var must be paired with
the existing INSECURE flag; either alone fails closed. The boot-time
WARN log at cmd/server/main.go:611 continues to fire for the ACK'd
case so every restart logs the reminder.

SEC-H3: tightens the sticky DemoModeAck bit so it expires after 24h.
When DemoModeAck=true, Validate() now requires CERTCTL_DEMO_MODE_ACK_TS
to be set as a unix-epoch timestamp within the last 24h (24h-tolerance
on the past side, 1-minute clock-skew on the future side). Catches the
"forgotten demo deployment promoted to production" failure mode —
next container restart past 24h refuses unless re-ack'd.

Tests in internal/config/config_test.go cover every new branch:
positive (passes when properly set), negative (each fail-closed path
fires with the matching sentinel + message-text). 11 new tests added.

Helm chart + HA runbook (DEPL-H1)
=================================

Created docs/operator/runbooks/ha.md documenting the three values
flips required for production HA: server.replicas, podDisruptionBudget,
service.sessionAffinity. Cross-link comments added to
deploy/helm/certctl/values.yaml next to the server.replicas (line 19)
and podDisruptionBudget (line 566) defaults. DEFAULTS DO NOT CHANGE
— that's the point per the prompt's 'do not flip networkPolicy default'
guidance: a default-enabled PDB blocks fresh helm install on
single-node clusters.

CI guard (DEPL-M2)
==================

scripts/ci-guards/no-change-me-in-prod-compose.sh grep-fails any
'change-me-' literal in compose files OTHER than docker-compose.demo.yml.
Catches the placeholder-credential-leak regression one layer earlier
than the runtime Validate() fail-closed guards from Bundle 2 (2026-05-12).
Excludes comment lines so docs explaining the pattern don't trip the
guard. Verified to fire on a synthetic leak; clean on the current tree.

Consolidated 'Security carve-outs' doc section
==============================================

docs/operator/security.md grows by one new section documenting the
seven existing carve-outs in one canonical place:
  - SEC-M3: 3 InsecureSkipVerify=true sites (Agent dev, verify probe, tlsprobe)
  - SEC-M5: F5 connector InsecureSkipVerify per-config field
  - SEC-M4: ACME insecure + new ACK gate
  - SEC-L1: CSP 'unsafe-inline' on style-src (Tailwind carve-out)
  - SEC-L2: break-glass Argon2id rest-defense reminder
  - SEC-L3: 1 MB body-size cap + CERTCTL_MAX_BODY_SIZE override
  - DEPL-M2: change-me-* placeholder credentials in demo overlay
  - DEPL-M3: K8s NetworkPolicy operator-opt-in default

Each entry cites the file:line, the rationale for the carve-out, and
the operator action.

CHANGELOG + ENVIRONMENTS coverage
==================================

CHANGELOG.md grows by one new '### Breaking changes (scheduled for
v2.2.0)' section under Unreleased, documenting SEC-H1 / SEC-M4 / SEC-H3
with explicit upgrade-window guidance for each.

deploy/ENVIRONMENTS.md adds five rows: AGENT_BOOTSTRAP_TOKEN +
AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY + DEMO_MODE_ACK + DEMO_MODE_ACK_TS +
ACME_INSECURE_ACK. G-3 env-docs-drift CI guard stays clean.

WORKSPACE-ROADMAP.md (cowork-side) schedules the SEC-H1 default-flip
for v2.2.0.

Sandbox limitation
==================

The certctl repo's working tree is 6.1 GB which fills the sandbox
volume; the go1.25.10 toolchain download (go.mod requires it,
sandbox has 1.25.9) keeps failing on disk-full. Local 'go build' /
'go test' were NOT run in this commit's verification path.
make verify MUST be run on the operator's workstation before push
per CLAUDE.md operating rules.

CI guards (no-change-me, G-3 env-docs-drift, doc-rot-detector, +
all existing) verified clean by running each individually.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SEC-H1,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-H3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M4,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H1,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M2,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M5,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L1,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L2,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L3
2026-05-13 19:50:00 +00:00
shankar0123 47da13e7a1 fix(helm): close BUNDLE 3 — Helm chart hardening + enterprise deploy
Bundle 3 closure (2026-05-12 acquisition diligence audit). Closes the
"chart claims production-ready but lying-fields silently break it"
hazard cluster: README install command had wrong key, required secrets
weren't fail-fast, external Postgres rendered the bundled StatefulSet
hostname, container-only security hardening fields landed at pod scope
(silently dropped by K8s API), and three advertised template surfaces
(ServiceMonitor, PodDisruptionBudget, NetworkPolicy) didn't render at
all even when their values.yaml toggles were on.

Source findings closed:
  C2 C3 D1 D2 D3 D5 D7 D11 D12       (repo audit)
  OPS-L1 OPS-L2                       (cowork audit)
Source findings explicitly deferred (tracked in WORKSPACE-ROADMAP.md):
  D6 OPS-H1   (backup automation — operator must choose target storage)
  D10         (digest pinning of latest `:latest` tags)
  OPS-M1      (prometheus/client_golang migration)
  OPS-M2      (distributed tracing instrumentation)

Chart truth table (rendered with helm 3.16.3):
  -f values.yaml + tls.existingSecret + auth.apiKey + pg.auth.password
    → 12 resources (default mode, no monitoring/PDB/networkpolicy)
  + postgresql.enabled=false + externalDatabase.url=…
    → NO StatefulSet, NO postgres-secret, NO postgres-service (D2)
  + server.tls.certManager.enabled=true
    → +1 Certificate (cert-manager mode)
  + replicas=3 + monitoring.enabled=true + serviceMonitor.enabled=true
    + podDisruptionBudget.enabled=true + networkPolicy.enabled=true
    → +1 ServiceMonitor + 1 PodDisruptionBudget + 1 NetworkPolicy (D5+D11)
  tls.existingSecret AND tls.certManager.enabled both set
    → REFUSED with "EXACTLY ONE TLS ownership path" error (D7)
  Missing required secrets (apiKey / pg password / external URL)
    → REFUSED at template time with operator-actionable guidance (D1)

Closures by source ID:

C2 — README Helm install example fixed. Was `--set postgresql.password=…`
  (does not exist); now `--set postgresql.auth.password=…` matching
  the chart key. README install block also wires TLS, mentions
  fail-fast at template time, and links the external-Postgres example.

C3 — Kubernetes Secrets connector annotated PREVIEW in values.yaml.
  The chart still exposes `kubernetesSecrets.enabled` for the RBAC
  preview wiring, but the values block now states clearly that the
  production K8s client at internal/connector/target/k8ssecret/
  k8ssecret.go::realK8sClient is a stub (verified — go.mod imports
  zero k8s.io/client-go packages). Production landing tracked in
  WORKSPACE-ROADMAP.md.

D1 — `certctl.requiredSecrets` template helper. Fail-fasts at render
  time when (a) server.auth.type=api-key + apiKey empty, (b)
  postgresql.enabled=true + pg.auth.password empty, (c)
  postgresql.enabled=false + externalDatabase.url + legacy env
  CERTCTL_DATABASE_URL all empty. Each branch emits an
  operator-actionable diagnostic with the openssl rand command or
  values override needed. postgres-secret template additionally
  uses Helm's `required` builtin so it can't render with the empty
  fallback that pre-Bundle-3 produced ("changeme" literal).

D2 — externalDatabase.url first-class. New top-level values block.
  certctl.databaseURL helper now branches on postgresql.enabled:
  bundled path uses the helper-emitted in-cluster URL; external
  path uses externalDatabase.url verbatim. postgres-secret,
  postgres-statefulset, and postgres-service ALL gate on
  postgresql.enabled — external mode renders ZERO postgres-*
  resources. POSTGRES_PASSWORD env in server-deployment also gates.

D3 — Container-vs-pod security context split. K8s API silently drops
  readOnlyRootFilesystem / allowPrivilegeEscalation / capabilities /
  privileged when they land at pod scope (`spec.securityContext`);
  they only work at container scope (`spec.containers[].securityContext`).
  Pre-Bundle-3 all fields sat at pod scope so the chart's documented
  "read-only rootfs + drop-all caps" hardening was effectively
  unenforced. New certctl.podSecurityContext + containerSecurityContext
  helpers split the operator-facing securityContext map by field-name
  whitelist so existing values keep working byte-for-byte while
  fields render at the K8s-valid scope. Applied to both
  server-deployment.yaml and agent-daemonset.yaml (DaemonSet + Deployment
  branches).

D5 — Prometheus ServiceMonitor template. New
  templates/servicemonitor.yaml. Renders when monitoring.enabled AND
  monitoring.serviceMonitor.enabled. Scrapes /api/v1/metrics/prometheus
  (rbac-gated on metrics.read — needs bearerTokenSecret with an API
  key holding that perm). values.yaml block extended with bearerTokenSecret,
  tlsConfig, and relabelings knobs and the operator-facing comment
  documenting the auth requirement.

D7 — TLS both-set rejection. certctl.tls.required helper extended.
  Pre-Bundle-3 only the NEITHER-set case was caught; setting BOTH
  rendered a dangling cert-manager Certificate alongside an
  existing-Secret mount, two conflicting TLS sources of truth.
  Now refuses with "EXACTLY ONE TLS ownership path" + remediation
  steps for both possible operator intents.

D11 — PodDisruptionBudget + NetworkPolicy templates. New
  templates/pdb.yaml (renders when podDisruptionBudget.enabled +
  server.replicas > 1) + templates/networkpolicy.yaml (renders when
  networkPolicy.enabled). PDB uses minAvailable / maxUnavailable
  exclusivity per K8s spec. NetworkPolicy default-allows in-namespace
  agent → server traffic, kube-DNS egress, and bundled-postgres
  egress (when postgresql.enabled), with operator-extensible
  extraIngress / extraEgress for CA / OIDC / SMTP egress. Both
  default off so existing deploys don't lose network reach
  unannounced.

D12 — Database max-conn config wired. Pre-Bundle-3
  internal/repository/postgres/db.go::NewDB hard-coded
  SetMaxOpenConns(25). config.go loaded CERTCTL_DATABASE_MAX_CONNS,
  Validate() enforced the >= 1 floor, values.yaml documented it,
  and docs/reference/configuration.md surfaced it — but the pool
  ignored every operator setting. New NewDBWithMaxConns threads
  the operator value into the pool with maxIdle = maxOpen / 5
  (≥ 1) so the historical ratio carries forward. cmd/server/main.go
  calls the new constructor; NewDB stays for compat at the default 25.

OPS-L1 — Chart version 0.1.0 → 1.0.0. Chart has shipped through 8 audit
  closures since 2026-02 (M-018, U-1, U-2, U-3, H-1, G-1, B1, B2);
  pre-1.0 version was implying instability the chart no longer has.

OPS-L2 — External-Postgres path is now properly documented in values.yaml
  (externalDatabase block with mode-2 example), README install command
  links the existing examples/values-external-db.yaml, and the chart
  truth table above proves the external mode renders cleanly.

Receipts:
  helm lint deploy/helm/certctl/                                # clean
  helm template c deploy/helm/certctl/ \
      --set server.tls.existingSecret=ci \
      --set postgresql.auth.password=p \
      --set server.auth.apiKey=k                                # 12 kinds, default
  helm template c deploy/helm/certctl/ \
      --set server.tls.existingSecret=ci \
      --set postgresql.enabled=false \
      --set externalDatabase.url='postgres://u:p@h:5432/db?sslmode=require' \
      --set server.auth.apiKey=k                                # 9 kinds, no postgres-*
  helm template c deploy/helm/certctl/ \
      --set server.tls.certManager.enabled=true \
      --set server.tls.certManager.issuerRef.name=letsencrypt \
      --set postgresql.auth.password=p --set server.auth.apiKey=k
                                                                # +1 Certificate (cert-manager)
  helm template c deploy/helm/certctl/ \
      --set server.tls.existingSecret=ci \
      --set postgresql.auth.password=p --set server.auth.apiKey=k \
      --set server.replicas=3 \
      --set monitoring.enabled=true \
      --set monitoring.serviceMonitor.enabled=true \
      --set podDisruptionBudget.enabled=true \
      --set networkPolicy.enabled=true                          # +ServiceMonitor +PDB +NetworkPolicy
  (TLS both-set + missing apiKey + missing pg password + missing extDb URL all REFUSED.)

  gofmt -l                                                      # clean
  go vet ./internal/repository/postgres ./cmd/server            # clean
  go build ./cmd/server                                         # clean
  bash scripts/ci-guards/B3-helm-chart-coherence.sh             # clean

Remaining operator warnings (deferred, tracked in WORKSPACE-ROADMAP.md):
  - Backup CronJob + restore script (D6 + OPS-H1): operator chooses
    target (S3, GCS, Azure Blob, NFS). Sample CronJob yaml may ship
    in deploy/helm/examples/ once an operator workstation has run
    one full backup-restore cycle.
  - Distributed tracing (OPS-M2): otel/* are go.mod indirect deps,
    not actively instrumented. Adding spans is a v3 work item.
  - Prometheus client_golang migration (OPS-M1): the hand-rolled
    /metrics/prometheus exposition format works today; client_golang
    migration unlocks histograms + exemplars + native label sets.

Audit-Closes: BUNDLE-3 C2 C3 D1 D2 D3 D5 D7 D11 D12 OPS-L1 OPS-L2
Audit-Defers: D6 D10 OPS-H1 OPS-M1 OPS-M2
2026-05-13 00:40:42 +00:00
shankar0123 2d9110b0c4 auth-bundle-2 Phase 0: dependency-add + oidc auth-type literal + runtime guard
Bundle 2 Phase 0 stages the dependencies + auth-type discriminator
literal that later phases consume. No handler chain wired yet; an
operator who sets CERTCTL_AUTH_TYPE=oidc on this commit gets a clear
refuse-to-start error rather than a silent fallback to api-key (the
G-1 failure mode that drove "jwt" out of the allowed set).

Deliverables:

* go.mod: github.com/coreos/go-oidc/v3 v3.18.0 added as a direct
  require. Per the pre-bundle dependency audit (Apache-2.0, zero CVEs
  ever per OSV.dev, 2,400+ stars, used by Hashicorp Vault + Dex +
  Hydra + Authentik + every Kubernetes OIDC integration), this is the
  ecosystem-standard Go OIDC client. Pinned to a specific minor
  (v3.18.0) per the prompt's "no bare latest" rule.
* go.mod: golang.org/x/oauth2 promoted from // indirect to direct,
  bumped from v0.34.0 to v0.36.0 by go mod tidy. Both versions are
  OSV-clean. Maintained by the Go team.
* No JSON-path library added (forbidden by the dependency audit; the
  group-claim resolver is hand-rolled in Phase 3).
* internal/config/config.go: AuthTypeOIDC constant added with a
  load-bearing comment explaining (a) this is the AUTH-TYPE literal,
  not a JWT alg literal, so the G-1 closure invariant is preserved
  ("jwt" stays out of ValidAuthTypes forever); (b) the runtime guard
  in cmd/server/main.go intentionally refuses-to-start when oidc is
  set pre-Phase-6 to avoid the silent-downgrade failure mode.
  ValidAuthTypes() now returns {api-key, none, oidc}.
* internal/config/config_test.go: TestValidAuthTypesIsExactly_APIKey_None
  renamed to TestValidAuthTypesIsExactly_APIKey_None_OIDC and now pins
  the 3-entry set. TestValidAuthTypesDoesNotContainJWT (G-1 closure
  test) still passes because "jwt" is never added back.
  TestValidate_GenericInvalidAuthType's bad-types list updated:
  "oidc" removed (now valid), "saml" added (correctly rejected per
  Decision 5's SAML deferral).
* cmd/server/main.go: defense-in-depth runtime auth-type guard now
  has an explicit AuthTypeOIDC case that exit(1)s with an actionable
  message: "the OIDC auth chain is not yet wired in this build (Auth
  Bundle 2 Phase 6 ships the session middleware that consumes this
  auth-type literal)." This closes the lying-field gap the literal
  would otherwise create. Phase 6 of Bundle 2 relaxes this case to
  fall through alongside api-key + none.
* api/openapi.yaml: /v1/auth/info auth_type enum extended from
  [api-key, none] to [api-key, none, oidc] with an in-line comment
  explaining the Phase-0-vs-Phase-6 timing so an OpenAPI consumer
  isn't surprised by "oidc" appearing here pre-Bundle-2-merge.
* deploy/helm/certctl/templates/_helpers.tpl::certctl.validateAuthType:
  valid set extended to include "oidc". Chart-time validation now
  passes for type=oidc; the binary's runtime guard takes over to
  refuse the start. Once Bundle 2 ships, the runtime guard relaxes
  and OIDC works end-to-end with no further chart edits.
* .env.example: CERTCTL_AUTH_TYPE comment block updated to document
  the three valid values + the Phase-0-vs-Phase-6 timing.
* internal/auth/oidc/doc.go: new package directory with package doc
  + transitional blank imports for coreos/go-oidc/v3 + x/oauth2 so
  go mod tidy keeps both deps as direct requires until Phase 3's
  service.go replaces the blanks with real symbol use. Doc explains
  the package layout (oidc/ + oidc/domain/ + oidc/groupclaim/ +
  oidc/testfixtures/) so the post-Bundle-2 reader can navigate.

Verifications:
* gofmt clean on every changed file.
* go vet clean on internal/config + cmd/server + internal/auth/oidc.
* go test -short -count=1 green on internal/config (including the
  G-1 closure + new validation tests), cmd/server, internal/auth (all
  Bundle 1 packages), internal/service/auth.
* govulncheck ./... clean (M-024 hard CI gate).
* All 24 ci-guards pass locally.

Phase 0 exit criteria from cowork/auth-bundle-2-prompt.md:
* go.mod shows coreos/go-oidc/v3 as direct: yes.
* golang.org/x/oauth2 is direct (not indirect): yes.
* govulncheck ./... clean: yes.
* No JSON-path library in go.mod / go.sum deltas: confirmed (only
  v3 of go-oidc + the x/oauth2 bump landed).
* make verify green: gofmt + vet + go test pass; full make verify
  (which would invoke golangci-lint) deferred to CI since the
  sandbox doesn't have golangci-lint installed; the operator runs
  make verify locally before pushing per CLAUDE.md operating rule.
2026-05-10 03:31:51 +00:00
shankar0123 0729ee46e0 chore: sweep github.com/shankar0123/certctl URL refs to certctl-io/certctl
Post-transfer cosmetic + release-critical URL refresh after moving the
repo from github.com/shankar0123/certctl to github.com/certctl-io/certctl
(2026-05-03). GitHub HTTP redirects continue to forward old URLs forever,
so existing operators are not broken — but aligns the canonical
references with the new owner so:

- procurement engineers / contributors browsing the docs see the right
  URL on first read
- operators copying the agent install one-liner hit the new path
  directly without going through a redirect
- the Helm chart's default image repository points at the canonical org
  registry path
- the OnboardingWizard rendered to first-run UI users shows the new
  URL in the install snippets and doc anchor links
- the GitHub Actions release workflow pushes container images to
  ghcr.io/certctl-io/certctl-{server,agent} (was: shankar0123)
- the release-notes Markdown body in release.yml — which gets stamped
  into every future release page — references the post-transfer
  cert-identity (cosign keyless signing now uses the certctl-io
  workflow URL) and the post-transfer SLSA provenance source-uri.
  Without this, every cosign verify / slsa-verifier command on a
  v2.1.0+ release would fail because the cert-identity-regexp would
  not match the signing identity GitHub Actions OIDC issues post-
  transfer. Old releases (v2.0.67 and earlier) keep their immutable
  release-notes pointing at the shankar0123 path and remain
  verifiable via their own published instructions.

Customer impact:
- Operators on ghcr.io/shankar0123/certctl-{server,agent}:latest
  silently freeze on whatever tag was current at transfer time. They
  get no errors; they just stop receiving updates. The next release
  notes need a one-line callout (Phase 3.1 of cowork/transfer-
  certctl-to-org.md) telling them to update their image path to
  ghcr.io/certctl-io/certctl-{server,agent}.
- All other URLs (git clone, install one-liner, raw.githubusercontent
  URLs, browser links, GitHub API) continue to resolve via permanent
  HTTP redirects. The sweep is cosmetic for those.

Files swept (30 total):
  .github/workflows/release.yml — IMAGE_NAMESPACE, source-uri,
    cosign cert-identity-regexp, IMAGE= snippet (5 refs total).
  CHANGELOG.md, README.md — anchor links, badges, install one-liner,
    cosign verify snippets in operator-facing sections.
  api/openapi.yaml — info / externalDocs URLs.
  install-agent.sh — GITHUB_REPO const + systemd unit Documentation=
    field.
  deploy/ENVIRONMENTS.md, deploy/helm/{CHART_SUMMARY,INDEX,
    INSTALLATION,README}.md, deploy/helm/certctl/{Chart.yaml,
    README.md,values.yaml}, deploy/helm/examples/values-*.yaml —
    chart docs + image repository defaults across dev / prod-ha
    overrides.
  docs/{certctl-for-cert-manager-users,connector-iis,connectors,
    migrate-from-acmesh,migrate-from-certbot,quickstart,test-env,
    why-certctl}.md — operator-facing doc URLs.
  examples/{acme-nginx,acme-wildcard-dns01,multi-issuer,
    private-ca-traefik,step-ca-haproxy}/docker-compose.yml +
    examples/step-ca-haproxy/step-ca-haproxy.md — example image:
    paths and accompanying narrative.
  web/src/pages/OnboardingWizard.tsx — first-run-UI URL refs (curl
    install one-liners, agent docker image path, doc anchor links).

Files intentionally NOT swept (Choice A from cowork/transfer-certctl-
to-org.md):
  go.mod, go.sum — module declaration stays github.com/shankar0123/
    certctl. Existing imports compile because Go uses the path
    declared in go.mod, not the URL it was fetched from. Internal-
    only project; no external Go consumers; rename will land as a
    mechanical sed when one materializes.
  ~250 *.go files — every import remains github.com/shankar0123/
    certctl/internal/...
  deploy/test/f5-mock-icontrol/go.mod — separate test sub-module;
    same Choice A logic; module path stays.

Files intentionally NOT swept (other reasons):
  README.md lines 244-245 — Scarf-pixel docker-pull commands.
    shankar0123.docker.scarf.sh/... is a Scarf-account hostname
    (per-user, not per-repo) and the pixel keeps tracking pulls
    against the operator's personal Scarf account. Migrating to a
    certctl-io Scarf account is a separate decision (create org
    Scarf account → re-create package → update README).
  deploy/test/f5-mock-icontrol/f5-mock-icontrol — checked-in
    compiled binary with shankar0123/certctl baked into Go build
    info via the sub-module path. Out of scope for a URL sweep;
    will refresh on the next `make test-integration` rebuild.

Verification:
  gofmt: clean (no .go files touched).
  go vet ./...: clean (verified at this SHA in 1.3 of the transfer
    checklist; no .go changes since).
  go build ./...: clean (same).
  go test -short on representative packages: green (same).
  Diff shape: 30 files, 74 insertions / 74 deletions, net-zero size,
    pure URL substitution.
2026-05-03 23:39:50 +00:00
shankar0123 45ba27693b Update LICENSE metadata 2026-04-26 23:29:59 +00:00
shankar0123 30f9f1e712 Bundle B: Auth & transport surface tightening — 5 findings closed
Closes M-001 + M-002 + M-013 + M-018 + M-025 from
comprehensive-audit-2026-04-25.

M-001 (CWE-916) — PBKDF2 100k -> 600k via v3 blob format
  internal/crypto/encryption.go:
    - New v3Magic (0x03), pbkdf2IterationsV3 (600,000 — OWASP 2024
      Password Storage Cheat Sheet floor), v3SaltSize (16 bytes),
      deriveKeyWithSaltV3 helper.
    - EncryptIfKeySet now unconditionally writes v3:
        magic(0x03) || salt(16) || nonce(12) || ciphertext+tag
    - DecryptIfKeySet falls through v3 -> v2 -> v1 with AEAD verification
      at each step. Wrong-passphrase v3 reads cannot be silently
      misattributed to v2/v1.
    - IsLegacyFormat updated to recognize 0x03 as non-legacy.
  internal/crypto/encryption_v3_test.go (NEW, 7 tests):
    V3 round-trip / V2 read-fallback against deterministic v2 fixture /
    V3 wrong-passphrase fails / V3-vs-V2 dispatch order / V2 vs V3 keys
    differ for same (passphrase, salt) / iteration-count pin at OWASP
    2024 floor / IsLegacyFormat-recognises-V3.
  Coverage internal/crypto: 86.7% -> 88.2%.

M-002 (CWE-862) — Auth-exempt allowlist constants + AST regression test
  Recon found auth-exempt surface spans TWO layers (audit's claim was
  incomplete):
    Layer 1 (router.go direct r.mux.Handle):
      GET /health, GET /ready, GET /api/v1/auth/info, GET /api/v1/version
    Layer 2 (cmd/server/main.go::buildFinalHandler URL-prefix dispatch):
      /.well-known/pki/*, /.well-known/est/*, /scep[/...]*
  internal/api/router/router.go:
    - New AuthExemptRouterRoutes constant with per-entry justifications.
    - New AuthExemptDispatchPrefixes constant.
  internal/api/router/auth_exempt_test.go (NEW, 2 tests):
    AST-walks router.go for every direct mux.Handle call and asserts
    set equals AuthExemptRouterRoutes; reads source bytes of Register /
    RegisterFunc and asserts they still wrap with middleware.Chain.
  cmd/server/auth_exempt_test.go (NEW, 2 tests):
    14-case table test on buildFinalHandler asserting documented
    prefixes route to noAuthHandler and authenticated routes route to
    apiHandler; inverse-overlap pin proves no documented bypass shadows
    an authenticated prefix.

M-013 (CWE-942) — CORS deny-by-default verified-already-clean + pin
  Audit claim 'default allows all origins if env-var unset' was WRONG.
  internal/api/middleware/middleware.go::NewCORS already denies cross-
  origin requests when len(cfg.AllowedOrigins) == 0 (no
  Access-Control-Allow-Origin header is emitted, same-origin policy
  applies).
  internal/api/middleware/cors_test.go: +TestNewCORS_NilOriginsDeniesAll
  + TestNewCORS_M013_ContractDocumentedInOrder (5-case table test
  pinning the 3-arm dispatch contract).

M-018 (CWE-319 / PCI-DSS Req 4) — Postgres TLS opt-in toggle
  deploy/helm/certctl/values.yaml: new postgresql.tls.{mode,caSecretRef}
    operator-facing knobs. Default 'disable' preserves in-cluster pod-
    network behavior; PCI-scoped operators set verify-full.
  deploy/helm/certctl/templates/_helpers.tpl: certctl.databaseURL helper
    pipes postgresql.tls.mode into ?sslmode=.
  deploy/helm/certctl/templates/server-secret.yaml: uses the helper
    instead of hardcoded sslmode=disable.
  deploy/docker-compose.yml: CERTCTL_DATABASE_URL is now
    ${CERTCTL_DATABASE_URL:-...} so operators override without editing.
  docs/database-tls.md (NEW): operator runbook covering 4 deployment
    shapes, RDS verify-full example with PGSSLROOTCERT mount, and
    pg_stat_ssl verification query.
  helm template + helm lint clean.

M-025 (OWASP ASVS L2 §11.2.1) — Per-key rate limiting
  internal/api/middleware/middleware.go::NewRateLimiter rewritten from
  a single global tokenBucket to a keyedRateLimiter map keyed on
    'user:'+GetUser(ctx)  for authenticated callers
    'ip:'+RemoteAddr-host for unauthenticated
  - Empty UserKey strings treated as unauthenticated.
  - X-Forwarded-For intentionally NOT consulted (header-spoofing risk).
  - Create-on-demand bucket allocation under sync.RWMutex with double-
    check pattern.
  RateLimitConfig.PerUserRPS / PerUserBurstSize fields with env vars
    CERTCTL_RATE_LIMIT_PER_USER_RPS / CERTCTL_RATE_LIMIT_PER_USER_BURST
    allow per-user budgets distinct from per-IP.
  internal/api/middleware/ratelimit_keyed_test.go (NEW, 5 tests):
    TwoIPsHaveIndependentBuckets / SameUserDifferentIPsShareBucket /
    TwoUsersHaveIndependentBuckets / PerUserBudgetOverride /
    EmptyUserKeyTreatedAsAnonymous.
  Coverage internal/api/middleware: 82.1% -> 83.7%.

Audit deliverables:
  cowork/comprehensive-audit-2026-04-25/audit-report.md: score
    25/55 -> 30/55 closed (High 7/9, Medium 7/27 -> 12/27, Low 8/19).
  cowork/comprehensive-audit-2026-04-25/findings.yaml: 5 status flips
    open -> closed with closure notes citing the Bundle B mechanism.
  certctl/CHANGELOG.md: Bundle B section under [unreleased].

Verification:
  go test -count=1 -short ./...                     all green
  staticcheck on changed packages                   no new SA*/ST* hits
    (the 4 pre-existing SA1019 sites in cmd/server/main_test.go are
    Bundle 9 / M-028 partial closure leftovers tracked in Bundle C)
  helm template + helm lint                         clean
  internal/repository/postgres setup-fail            sandbox disk pressure,
    same on master HEAD before this branch — environmental, not Bundle B
2026-04-26 23:09:10 +00:00
shankar0123 86fffa305a fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2)
Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image
shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`.
The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone
(`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS
1.3 pinned), so the probe failed on every interval and Docker marked
the container `unhealthy` indefinitely. Operators inside docker-
compose / Helm / the example stacks were unaffected — compose overrides
the HEALTHCHECK with `--cacert + https://`, Helm uses explicit
`httpGet` probes that ignore Docker's HEALTHCHECK, and every example
compose file overrides with `curl -sfk https://localhost:8443/health`.
But anyone running bare `docker run` / Docker Swarm / Nomad / ECS —
exactly the "I just pulled the published image" path — saw permanent
`unhealthy` status and (depending on orchestrator policy) a restart-
loop. (Audit: cat-u-healthcheck_protocol_mismatch in
coverage-gap-audit-2026-04-24-v5/unified-audit.md.)

Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone
gap, both bundled into this commit because they share the same root
cause and the same operator surface:

  1. Helm chart `server.readinessProbe.httpGet.path` pointed at
     `/readyz`, the kube-flavored convention. The certctl server
     doesn't register `/readyz` (only `/health` and `/ready` are
     wired and bypass the auth middleware — see
     internal/api/router/router.go:81 and cmd/server/main.go:920).
     K8s readiness probes therefore got 401 (api-key auth rejection)
     or 404 (when auth was disabled), pods stayed `NotReady`
     indefinitely, and Helm rollouts stalled.

  2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all,
     so bare-`docker run` agents got zero health signal. The
     compose override at `deploy/docker-compose.yml:173` called
     `pgrep -f certctl-agent` against the agent image, but the
     agent image didn't ship `procps` — pgrep was missing too. The
     compose probe was a latent always-fail.

We fixed all three with the audit-recommended shape (option (a) — `-k`)
plus three structural backstops:

Files changed:

Phase 1 — Dockerfile fix:
- Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/
  health` to `curl -fsk https://localhost:8443/health`. `-k`
  (insecure) is acceptable because the probe is localhost-to-localhost:
  the same process serving the cert is being probed, no network hop.
  Pinning `--cacert` is not viable for the published image because
  the bootstrap cert is per-deploy (generated into the `certs` named
  volume on first up; operator-supplied via Helm's `existingSecret`
  or cert-manager). Long-form docblock cross-references the audit
  closure, the compose vs Helm vs examples coverage matrix, and the
  CI guardrail.
- Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent`
  matching the compose pattern. Added `procps` to the runtime apk
  install — fixes both the new image-level HEALTHCHECK AND the
  pre-existing compose probe that was silently failing.

Phase 2 — Helm readiness probe path:
- deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path
  changed from `/readyz` to `/ready`. Liveness probe path
  (`/health`) was correct and is unchanged. Probes block now carries
  an explanatory comment naming the registered no-auth probe routes
  and the U-2 closure rationale.

Phase 3 — Image-level integration tests:
- deploy/test/healthcheck_test.go (new, //go:build integration):
  TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server
  image, inspects `Config.Healthcheck.Test` via `docker inspect`,
  and asserts the array contains `https://localhost:8443/health` and
  `-k`, and does NOT contain `http://localhost:8443/health`
  (positive + negative regression contracts).
  TestPublishedAgentImage_HealthcheckSpecExists builds the agent image
  and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`.
  Both tests `t.Skip` cleanly when docker isn't available (sandbox /
  CI without docker-in-docker) — verified locally: tests skip with the
  diagnostic and the suite returns PASS.
  TestPublishedServerImage_HealthcheckTransitionsToHealthy is a
  documented `t.Skip` placeholder until the harness wires a sidecar
  postgres for image-level smoke; the spec-level tests above cover the
  audit-flagged regression.

Phase 4 — CI guardrail:
- .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK
  regression guard (U-2)" step. Scoped patterns catch
  `HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health`
  in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md
  out of scope (the post-cutover invariant string at line 182 is
  intentionally a documented expected-failure assertion). Verified
  locally on the real tree (passes) and against synthetic regressions
  (each fires the guard).

Phase 5 — Docs sweep:
- docs/connectors.md: 15 stale curl examples updated from
  `http://localhost:8443/...` to `https://localhost:8443/...` with
  `--cacert "$CA"` injected on every site. Added a one-time
  introductory note documenting the `$CA` extraction with
  `docker compose ... exec ... cat /etc/certctl/tls/ca.crt`,
  matching the pattern in docs/quickstart.md. Pre-U-2 these examples
  silently failed against the HTTPS listener.

Phase 6 — Release surface:
- CHANGELOG.md: appended U-2 section to the existing [unreleased]
  block (immediately below the G-1 entry). Sections: explanatory
  blockquote covering all three bugs (primary + 2 adjacent), Fixed,
  Added, Changed.

Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go vet -tags integration ./deploy/test/ — clean
- go test -short ./... — every package green
- go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ —
  three tests SKIP cleanly with "docker not available" diagnostic
- helm lint deploy/helm/certctl/ — clean
- helm template smoke render — succeeds; rendered Deployment carries
  `path: /ready` and zero `/readyz` matches
- python3 yaml.safe_load on api/openapi.yaml — parses
- govulncheck ./... — no vulnerabilities in our code
- CI guardrail mirror: clean on real tree, fires on synthetic
  regression patterns

Out of scope (intentionally untouched):
- cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct,
  this finding does NOT propose adding back a plaintext listener.
- deploy/docker-compose.yml:126 HEALTHCHECK — already correct.
- deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct.
- All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already
  correct (they ALSO use `-fsk https://localhost:8443/health`).
- Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` +
  `path: /health`, correct.
- docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health`
  invariant line — that's the expected-failure assertion for the
  post-cutover state ("plaintext is gone, expect Connection refused");
  intentionally left intact.
- Go production code — this is purely a deploy-image / probe / docs /
  Helm-chart fix.

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-u-healthcheck_protocol_mismatch
      Audit recommendation followed verbatim: 'change Dockerfile:80
      to CMD curl -kf https://localhost:8443/health'.
2026-04-25 12:02:18 +00:00
shankar0123 9c1d446e40 fix(security,config): remove unimplemented JWT auth-type, close silent downgrade (G-1)
The pre-G-1 config validator accepted CERTCTL_AUTH_TYPE=jwt and the
startup log faithfully echoed 'authentication enabled type=jwt'.
Reasonable people read that and concluded JWT auth was on. It wasn't.
The auth-middleware wiring at cmd/server/main.go unconditionally routed
every request through the api-key bearer middleware regardless of
cfg.Auth.Type. So CERTCTL_AUTH_TYPE=jwt quietly compared the incoming
'Authorization: Bearer <token>' against whatever string the operator put
in CERTCTL_AUTH_SECRET — real JWT clients got 401, and operators who
treated CERTCTL_AUTH_SECRET as a *signing* secret (because they thought
they were configuring JWT) had effectively handed an attacker an api-key.
A security finding masquerading as a config option.

We chose the audit-recommended structural fix: remove the option, fail
fast at startup, and add the gateway-fronting pattern as the documented
forward path. Implementing JWT middleware would have meant jwks vs
static-secret rotation, claim mapping, expiry enforcement, audience and
issuer validation, key rollover semantics, and regression coverage at the
same depth as the existing api-key path — a feature, not a fix. Operators
who genuinely need JWT/OIDC front certctl with an authenticating gateway
(oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium /
Authelia) and run the upstream certctl with CERTCTL_AUTH_TYPE=none. Same
shape works on docker-compose and Helm.

The change is comprehensive across 7 phases — every surface that
mentioned 'jwt' as a certctl-auth-type is updated, plus structural
backstops (typed enum, runtime guard, helm template validation, CI grep
guard) so the lie can't reappear.

Files changed:

Phase 1 — production code (typed enum + jwt removal):
- internal/config/config.go: AuthType typed alias + AuthTypeAPIKey /
  AuthTypeNone constants + ValidAuthTypes() helper. Validate() routes
  literal 'jwt' through a dedicated multi-line diagnostic naming the
  authenticating-gateway pattern, then cross-checks against
  ValidAuthTypes(). Secret-required branch simplified to api-key-only.
  Field comment on AuthConfig.Type rewritten to drop jwt and point at
  the gateway pattern.
- internal/api/middleware/middleware.go: AuthConfig.Type field comment
  references the typed config.AuthType constants.
- internal/api/handler/health.go: same treatment for HealthHandler.AuthType.
- cmd/server/main.go: defense-in-depth runtime switch immediately after
  config.Load() — exits 1 on any unsupported auth-type that bypassed the
  validator. Auth-disabled startup log explicitly names the
  authenticating-gateway pattern.

Phase 2 — tests (Red→Green, contract pinning):
- internal/config/config_test.go: TestValidate_JWTAuth_RejectedDedicated
  (two table rows pinning the dedicated G-1 error fires regardless of
  whether Secret is set), TestValidAuthTypesDoesNotContainJWT (property
  guard against future re-introduction),
  TestValidAuthTypesIsExactly_APIKey_None (allowed-set contract),
  TestValidate_GenericInvalidAuthType (pins non-jwt invalid values still
  hit the generic invalid-auth-type error). Removed the prior
  TestValidate_JWTAuth_MissingSecret happy-path since its premise is
  inverted post-G-1.
- internal/api/handler/health_test.go: removed
  TestAuthInfo_ReturnsAuthType_JWT (which baked the silent-downgrade lie
  into the regression suite). Pre-existing _APIKey test continues to
  cover the api-key happy path.

Phase 3 — spec, docs, env templates:
- api/openapi.yaml: auth_type enum dropped to [api-key, none] with
  inline comment naming the G-1 closure.
- .env.example (root): CERTCTL_AUTH_TYPE comment block rewritten to drop
  jwt and point at the gateway pattern; secret-required conditional
  simplified to api-key-only.
- docs/architecture.md: middleware-stack bullet rewritten to drop the
  JWT mention; new H3 'Authenticating-gateway pattern (JWT, OIDC, mTLS)'
  section explaining the design rationale and listing oauth2-proxy /
  Envoy ext_authz / Traefik ForwardAuth / Pomerium / Authelia / Caddy
  forward_auth / Apache mod_auth_openidc / nginx auth_request as the
  standard fronting options.
- docs/upgrade-to-v2-jwt-removal.md (new ~125 lines): migration guide
  with preconditions, what-changes, both recovery paths, complete
  docker-compose oauth2-proxy walkthrough, Traefik ForwardAuth and Envoy
  ext_authz patterns, rollback posture.

Phase 4 — Helm chart (template validation + docs):
- deploy/helm/certctl/templates/_helpers.tpl: new certctl.validateAuthType
  helper mirroring the existing certctl.tls.required pattern. Fails
  template render on any server.auth.type outside {api-key, none} with
  a multi-line diagnostic.
- deploy/helm/certctl/templates/server-deployment.yaml,
  server-configmap.yaml, server-secret.yaml: invoke the helper at the
  top of each template that depends on .Values.server.auth.type.
- deploy/helm/certctl/values.yaml: auth: block comment expanded with the
  G-1 rationale and gateway-pattern cross-reference.
- deploy/helm/CHART_SUMMARY.md: server.auth.type table row now surfaces
  the allowed set and points at the upgrade doc.
- deploy/helm/certctl/README.md: new 'JWT / OIDC via authenticating
  gateway' section with a Kubernetes-flavored oauth2-proxy + certctl
  walkthrough.

Phase 5 — release surface:
- CHANGELOG.md: new [unreleased] top entry with Breaking / Removed /
  Added / Changed sections; explicit pointer at
  docs/upgrade-to-v2-jwt-removal.md from the Breaking subsection.

Phase 6 — CI guardrail:
- .github/workflows/ci.yml: new 'Forbidden auth-type literal regression
  guard (G-1)' step. Scoped patterns catch the actual regression shapes
  (map literal, slice literal, switch case, OpenAPI enum, env-file
  default, AuthType('jwt') cast). Comments and the dedicated rejection
  branch are intentionally exempt; connector-package JWT references
  (Google OAuth2 / step-ca) are exempt as out-of-scope external
  protocols. Verified locally: the guard passes on the actual tree and
  fires on all 4 synthetic regression patterns.

Out of scope (explicitly untouched):
- internal/connector/discovery/gcpsm/gcpsm.go — Google OAuth2 service-
  account JWT (external protocol).
- internal/connector/issuer/googlecas/googlecas.go — same.
- internal/connector/issuer/stepca/stepca.go — step-ca's provisioner
  one-time-token JWT for /sign API.
- docs/test-env.md, docs/connectors.md, docs/features.md — describe
  external CAs' use of JWT, not certctl's auth shape.
- Implementing actual JWT middleware. Feature, not a fix.

Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go test -short ./... — every package green
- go test -short -race ./internal/config/... ./internal/api/... — clean
- govulncheck ./... — no vulnerabilities in our code
- helm lint deploy/helm/certctl/ — clean
- helm template with auth.type=api-key — renders OK
- helm template with auth.type=none — renders OK
- helm template with auth.type=jwt — fails with validateAuthType
  diagnostic (exit 1)
- python3 yaml.safe_load on api/openapi.yaml — parses
- CI guardrail mirror — clean on real tree, fires on all 4 synthetic
  regression patterns
- Smoke test: 'CERTCTL_AUTH_TYPE=jwt ./certctl-server' exits non-zero
  with: 'Failed to load configuration: CERTCTL_AUTH_TYPE=jwt is no
  longer accepted (G-1 silent auth downgrade): no JWT middleware ships
  with certctl. To use JWT/OIDC, run an authenticating gateway
  (oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium) in
  front of certctl and set CERTCTL_AUTH_TYPE=none on the upstream.
  See docs/architecture.md "Authenticating-gateway pattern" and
  docs/upgrade-to-v2-jwt-removal.md for the migration walkthrough'

config pkg coverage: ValidAuthTypes 100%, Validate 94.7%, total 75.5%.

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-g-jwt_silent_auth_downgrade
      Audit recommendation followed verbatim: 'Remove jwt from
      validAuthTypes until middleware ships'.
2026-04-25 00:22:23 +00:00
shankar0123 af47d19ae2 fix(deploy,examples,env): close U-1 trap end-to-end across Helm, examples, and root env
Follow-up to cfc234e (U-1 docker-compose fix) — closes the remaining adjacent
code paths that share the postgres-first-boot-password-binding root cause but
were scoped out of the original commit.

The runtime diagnostic in internal/repository/postgres/db.go::wrapPingError
(landed in a911970) already covers every NewDB call site, so Helm operators
and example users hit the SQLSTATE 28P01 guidance for free at startup. What
was missing: deployment-shape-specific remediation guidance (kubectl vs
docker-compose), the hardcoded password in the *root* .env.example, and
shared ops notes for the 5 examples/ compose files. This commit closes all
three.

Files changed:

- .env.example (root) — line 16 had `postgres://certctl:certctl@...` with
  the password hardcoded literally instead of interpolating POSTGRES_PASSWORD.
  Edit if a user copied this file as their .env (binary-direct deployment,
  not docker-compose) and rotated POSTGRES_PASSWORD on line 10, the URL on
  line 16 still carried 'certctl' — silent two-line drift. Replaced 'certctl'
  with the same default that line 10 carries ('change-me-in-production') and
  added an explanatory comment block describing the docker-compose
  override semantics, when this URL matters (binary-direct), and the
  cross-reference to the U-1 wrapPingError diagnostic. Also fixed an
  adjacent bug: line 31 CERTCTL_SERVER_URL was `http://localhost:8443`,
  which agents reject at startup since v2.2 (HTTPS-everywhere milestone made
  the control plane HTTPS-only with TLS 1.3 pinned). Updated to https://
  with a comment pointing operators at the bootstrap CA bundle.

- deploy/helm/certctl/values.yaml — postgresql.auth.password field had a
  one-line 'REQUIRED' comment. Expanded into a full WARNING block (~25
  lines) explaining the PVC retention semantics, the failure symptom,
  and both kubectl-flavored remediation paths: non-destructive
  (`kubectl exec ... ALTER ROLE`) preferred for environments with data,
  and destructive (`helm uninstall + kubectl delete pvc`) for dev/demo.
  Cross-references the wrapPingError runtime diagnostic.

- deploy/helm/certctl/README.md (new, ~115 lines) — chart-level operational
  guide. Covers quick install, both remediation paths with concrete
  kubectl commands, why-we-don't-fix-this-in-the-chart explanation,
  cross-references to the docker-compose docs, server API key rotation
  (the easy case — comma-separated key list), TLS provisioning shapes,
  embedded-vs-external postgres, and uninstall semantics with the PVC
  retention gotcha called out.

- examples/README.md (new, ~55 lines) — shared operational notes for the
  5 example deployments. Covers the postgres password rotation trap with
  example-flavored remediation paths (`docker compose -f examples/<x>/...`),
  the TLS warning, and teardown semantics. Replaces what would otherwise
  be 5x duplication across per-example READMEs.

- examples/{acme-nginx,acme-wildcard-dns01,multi-issuer,private-ca-traefik,
  step-ca-haproxy}/*.md — one-line cross-reference at the top of each
  example's primary doc, pointing at examples/README.md for the shared
  ops notes. Avoids 5x duplication of the same warning text while still
  surfacing the link in every operator's first-touch surface.

Verification:

- go build ./... — clean
- go vet ./... — clean
- go test -short ./internal/repository/postgres/ — 4/4 wrapPingError tests
  still passing (no production-code touch in this commit)
- helm lint deploy/helm/certctl/ — clean (1 INFO about chart icon, pre-existing)
- helm template smoke test — renders without error
- python3 yaml.safe_load on values.yaml — parses

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap
      Closes the three deliberate scope-outs from cfc234e (Helm,
      root .env.example, examples/) end-to-end.

      Adjacent bugs caught while in scope:
      - root .env.example:16 hardcoded password not matching line 10
      - root .env.example:31 http:// URL incompatible with HTTPS-only v2.2
2026-04-24 23:51:13 +00:00
shankar0123 52248be717 v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP
Breaking change release. Plaintext HTTP listener removed. The certctl
control plane now terminates TLS 1.3 on :8443 via
http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape
hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md.

Server
- cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert
  swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback),
  preflightServerTLS validation
- cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe,
  watchSIGHUP wiring, cert/key path config threading
- tls_test.go: 418-line regression coverage of reload, preflight,
  callback behavior, SAN validation

Config
- CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required)
- Plaintext rejection: agents/CLI/MCP pre-flight-fail on http://
  URLs with a pointer to docs/upgrade-to-tls.md

Agents, CLI, MCP
- All three pre-flight-reject http:// URLs with fail-loud diagnostic
- CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust
- CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass
  (loud warning on startup)
- install-agent.sh emits both vars as commented template lines

docker-compose
- certctl-tls-init sidecar generates SAN-valid self-signed cert into
  deploy/test/certs/ on first boot
- All demo-stack curls pin against ca.crt with --cacert

Helm chart
- Three TLS provisioning modes, exactly one required:
  - server.tls.existingSecret (operator-supplied)
  - server.tls.certManager.enabled (cert-manager integration)
  - server.tls.selfSigned.enabled (eval only — not for production)
- server-certificate.yaml template for cert-manager mode
- helm install without a TLS source fails at template render with
  a pointer to docs/tls.md

CI
- .github/workflows/ci.yml Helm Chart Validation step renders the
  chart in both existingSecret and cert-manager modes, plus an
  inverse guard-regression test that asserts helm template MUST
  refuse to render when no TLS source is configured. Previously
  the single `helm template` invocation hit the certctl.tls.required
  fail-loud guard and exit-1'd CI. Four invocations now: lint
  (existingSecret), template (existingSecret), template
  (cert-manager), template (no args — must fail).

Integration tests
- deploy/test/integration_test.go stands up the Compose stack over
  HTTPS, extracts the CA bundle, and exercises every certctl API
  over https://localhost:8443
- All 34 integration subtests green (per Phase 8 local CI-parity)

Documentation
- New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload)
- New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade
  warnings, fleet-roll sequencing)
- CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry
  (file heading unchanged; release tag is v2.0.47)
- All curls in docs/, examples/, deploy/helm/ guides use
  https://localhost:8443 --cacert

Verification
- grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits
- grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin
  API default, SSRF doc comment) — zero certctl endpoints
- Tasks #197–#206 (Phases 0–8) all closed in the tracker

Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix).
2026-04-20 03:43:10 +00:00
shankar0123 a49eae8155 fix: correct BSL 1.1 change date to March 14, 2033
why-certctl.md said March 1, CHART_SUMMARY.md said March 28. The
LICENSE file is authoritative: Change Date is March 14, 2033.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 11:12:49 -04:00
shankar0123 5567d4b411 feat(M47): add Kubernetes Secrets target + AWS ACM PCA issuer connectors
Implement both M47 connectors with full cross-layer wiring:

Kubernetes Secrets target: DNS-1123 validation, kubernetes.io/tls Secret
create-or-update, chain concatenation, serial number validation, Helm
RBAC gating. 18 tests.

AWS ACM Private CA issuer: synchronous issuance (like Vault), ARN regex
validation, RFC 5280 revocation reason mapping, CA cert retrieval,
factory + env var seeding. 23 tests.

Cross-cutting: domain types, service validation, config, factory, agent
dispatch, frontend (TargetsPage, issuerTypes), OpenAPI, seed data, Helm
chart, connectors docs, README. Testing docs (testing-guide, qa-test-guide,
qa_test.go) with Parts thematically integrated near related connectors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-07 20:21:09 -04:00
shankar0123 397d2a1588 fix(helm): remove fail on empty postgresql password for lint/template
Default to "changeme" so helm lint and helm template pass with stock
values. Operators override at install time via --set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-28 21:30:13 -04:00
shankar0123 65567d0d83 fix(helm): type comparison error and lint-time fail on empty apiKey
- Use gt (int .Values.server.replicas) 1 to avoid incompatible type
  comparison between YAML integer and template literal
- Remove fail directive for empty apiKey — lint runs with defaults,
  operators set the key via --set at install time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-28 21:28:05 -04:00
shankar0123 0abd984285 fix: staticcheck S1016 struct conversion + Helm with/else-if parse error
- Use type conversion DigestStatusCount(c) instead of struct literal
- Replace with...else-if (invalid in Go templates) with if...else-if chain
- Add *.bak and cmd/agent/*.key/*.pem to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-28 21:25:25 -04:00
shankar0123 ec21c9bb29 feat(m28+m29+m30): ACME ARI, email digest, and Helm chart
M28: ACME Renewal Information (RFC 9702) — CA-directed renewal timing
with cert ID computation, directory endpoint discovery, graceful
degradation for non-ARI CAs. 19 tests.

M29: Email notifier wiring + scheduled certificate digest — SMTP
connector bridged to service layer via NotifierAdapter, DigestService
with HTML email template, 7th scheduler loop (24h), digest preview/send
API endpoints and GUI card. 21 tests.

M30: Production-ready Helm chart — server Deployment, PostgreSQL
StatefulSet, agent DaemonSet, ConfigMaps, Secrets, Ingress, security
contexts, health probes, example values for dev/prod/ACME scenarios.

Also: OpenAPI spec updates, MCP tool additions, CI helm-lint job,
documentation updates across 5 doc files and README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-28 21:18:35 -04:00