Commit Graph

382 Commits

Author SHA1 Message Date
shankar0123 0e29c416b1 refactor(handler,repo): replace strings.Contains error dispatch with typed sentinels (S-2)
Closes one 2026-04-24 audit finding (P2):

  - cat-s6-efc7f6f6bd50: 30 strings.Contains(err.Error(), ...) sites
    in internal/api/handler/ — brittle to repository-layer message
    changes, untyped against the actual failure mode.

Approach (Option B from prompt design notes):
  - New typed sentinels in internal/repository/errors.go:
      ErrNotFound, ErrForeignKeyConstraint
      IsForeignKeyError(err) helper (the only place substring
      matching at the lib/pq boundary is allowed; isolates the
      DB-driver string knowledge to one function).
  - New typed sentinel in internal/domain/errors.go:
      ErrValidation (reserved for future per-entity validation
      wrappers; not yet used by all handlers).
  - 49 sites in internal/repository/postgres/*.go updated to wrap
    sql.ErrNoRows-derived errors via fmt.Errorf("...: %w",
    repository.ErrNotFound).
  - 18 not-found handler sites + 2 FK-constraint handler sites
    refactored to errors.Is(err, repository.ErrNotFound) /
    repository.IsForeignKeyError(err).
  - 23 inline `fmt.Errorf("X not found")` test fixtures across
    handler tests rewrapped to wrap repository.ErrNotFound.
  - test_utils.go::ErrMockNotFound rewrapped to wrap
    repository.ErrNotFound; renewal_policy.go closure docblock
    updated to reflect the new convention.
  - integration test mockJobRepository.Get wraps repository.ErrNotFound.

CI regression guardrail:
- .github/workflows/ci.yml::"Forbidden strings.Contains(err.Error())
  regression guard (S-2)" greps for the three patterns ("not found",
  "violates foreign key", "RESTRICT") under internal/api/handler/
  and fails the build on regression.

Verification:
- go build ./... — clean
- go vet ./... — clean
- go test ./... -short -count=1 — all packages pass (handler +
  repository + service + integration)
- golangci-lint v2.11.4 run ./... — 0 issues
- S-2 guardrail dry-run on post-fix tree → empty (good)
- All sibling guardrails (S-1, G-3, D-1+D-2, B-1, L-1, H-1, C-1, F-1, P-1) pass

Audit findings closed:
- cat-s6-efc7f6f6bd50 (P2)

Deferred follow-ups:
- 6 domain-specific substring patterns still inline in handlers
  ("cannot approve", "cannot reject", "cannot be parsed",
  "no certificates found", "challenge password", "invalid"/
  "required" validation chains in profiles + agent_groups). Each
  needs its own typed sentinel, scoped per service. Documented
  by the S-2 CI guardrail's allowlist for closure-comments only.
- Per-entity not-found sentinels (Option A — ErrCertificateNotFound,
  ErrAgentNotFound, etc.) deferred. Generic ErrNotFound covers the
  current dispatch needs; per-entity precision would let handlers
  return entity-aware error bodies without a domain.Type field,
  but not blocking.
2026-04-25 17:54:14 +00:00
shankar0123 8a3086c4ae Merge branch 'fix/p1-master-orphan-client-fn-sweep' (P-1 master, 2 audit findings) 2026-04-25 17:41:12 +00:00
shankar0123 d4c421b98d chore(web,ci): document orphan client fns + sync guard (P-1 master)
Closes two 2026-04-24 audit findings:

  - diff-04x03-d24864996ad4 (P2, "26 orphan client fns")
  - cat-b-dc46aadab98e   (P3, "16 singleton-getter orphans")

Recon at HEAD found 17 actual orphans (not 26 or 16 — the audit
numbers conflated; many were eliminated by the B-1 / S-1 / I-2 /
D-2 closures since the audit was written, and the audit's regex
double-counted in some buckets). All 17 are detail-page candidates:
singleton-getter `getX(id)` fns that detail pages will need when
the corresponding `XPage` grows a `XDetailPage` route. Two valid
closures:
  - delete each fn (forces re-add when detail pages land)
  - document each as intent-suspect-but-preserved (lets future
    detail-page work land without a client.ts edit detour)

Picked the document-and-preserve path. Reasons:
  - Many of the 17 are obvious detail-page candidates (Owner,
    Team, AgentGroup, Policy, RenewalPolicy, Notification,
    AuditEvent, NetworkScanTarget, HealthCheck, DiscoveredCertificate)
    given the existing list-page + Edit-modal pattern shipped in B-1.
  - The cost of the deletes (and re-adds, and test re-adds) outweighs
    the cost of carrying 17 documented-orphan declarations.
  - registerAgent (already covered by C-1's docblock as by-design
    pull-only) sits in this same set and is the canonical "preserved
    orphan" precedent.

Changes:
- web/src/api/client.ts: new docblock at file-top listing all 17
  documented orphans with their detail-page rationale and a
  pointer to the CI guardrail.
- .github/workflows/ci.yml: new step "Documented orphan client fns
  sync guard (P-1)" verifies that every name in the docblock is
  still declared as `export const X = ...` somewhere in client.ts.
  Catches drift in either direction (delete export but forget
  docblock = MISSING; delete docblock entry but leave export =
  silent orphan accumulation, caught only on next mass-recon).

Verification:
- P-1 guardrail dry-run on post-fix tree → MISSING='' (empty, good)
- tsc --noEmit — clean
- golangci-lint v2.11.4 run ./... — 0 issues
- All sibling guardrails (S-1, G-3, D-1+D-2, B-1, L-1, H-1, C-1, F-1) pass

Audit findings closed:
- diff-04x03-d24864996ad4 (P2)
- cat-b-dc46aadab98e (P3)

Deferred follow-ups:
- The 17 detail-page candidates remain orphan until a XDetailPage
  consumer lands. Each future detail-page commit removes one entry
  from the docblock as it gains a real consumer. The CI guardrail
  enforces the docblock-↔-export sync regardless.
2026-04-25 17:41:12 +00:00
shankar0123 1bdab897ef Merge branch 'fix/f1-master-certificates-page-ux' (F-1 master, 2 audit findings) 2026-04-25 17:38:54 +00:00
shankar0123 94ca69554b feat(web): expand CertificatesPage filters + reusable DataTable pagination (F-1 master)
Closes two 2026-04-24 audit findings (P2):

  - cat-e-610251c8f72d: CertificatesPage exposed only 5 of the
    backend handler's 17 supported query filters. Audit recommended
    minimum-add: team_id (already first-class elsewhere),
    expires_before (drives the "expiring in N days" workflow), and
    sort (sort by notAfter for the most common operator triage).
    Fix: 3 new useState hooks + 3 new filter UIs in the toolbar +
    3 new param wires. Remaining filters (agent_id, expires_after,
    created_after, updated_after, cursor, fields, sort_desc) deferred
    until a consumer use case demands them — over-stuffing the
    toolbar is its own UX cost.

  - cat-k-e85d1099b2d7: CertificatesPage rendered the first 50
    certs returned by the backend with no way to advance. Backend
    response carries {data, total, page, per_page} — a pure render
    gap. Fix: lifted pagination into the reusable DataTable
    component as an opt-in `pagination?` prop. CertificatesPage is
    the first consumer; TargetsPage / IssuersPage / OwnersPage /
    others can adopt by passing the same prop.

DataTable changes:
- New `PaginationProps` interface (page, perPage, total,
  onPageChange, onPerPageChange?, perPageOptions?).
- New optional `pagination?` prop on DataTable.
- New `PaginationControls` subcomponent rendered in the table
  footer when `pagination` is set and `total > 0`. Renders
  "Showing X–Y of Z" + per-page selector + page counter +
  Prev/Next buttons. Disabling logic guards both boundaries.

CertificatesPage changes:
- 3 new filter useState hooks: teamFilter, expiresBefore, sortBy.
- 2 new pagination useState hooks: page (1), perPage (50).
- Added 4th cohort hook: getTeams via useQuery (mirrors the
  existing issuers/owners/profiles filter-data pattern).
- params object gains team_id, expires_before, sort, page, per_page.
- 3 new filter UIs in the toolbar (team select, expires_before
  date picker, sort select).
- DataTable gets the new pagination prop.
- Filter changes reset page=1 to keep results visible.

Verification:
- tsc --noEmit — clean
- vitest run — 9 files, 302 tests passing (no regression)
- golangci-lint v2.11.4 run ./... — 0 issues
- All sibling guardrails (S-1, G-3, D-1+D-2, B-1, L-1, H-1, C-1) pass

Audit findings closed:
- cat-e-610251c8f72d (P2)
- cat-k-e85d1099b2d7 (P2)

Deferred follow-ups:
- 8 backend filters (agent_id, expires_after, created_after,
  updated_after, cursor, fields, sort_desc, plus secondary sort
  fields) deferred until consumer demand justifies UI weight.
- TargetsPage / IssuersPage / OwnersPage / etc. opt-in to the
  pagination prop incrementally — DataTable now supports it; per-
  page adoption is a follow-up commit each.
- CertificatesPage Vitest coverage of the new filter+pagination
  paths deferred to the per-page test campaign (cat-s2-c24a548076c6).
2026-04-25 17:38:54 +00:00
shankar0123 c4d231e728 Merge branch 'fix/c1-master-cleanup-and-doc-tail' (C-1 master, 6 audit findings) 2026-04-25 17:34:59 +00:00
shankar0123 1c6009a920 chore(cleanup,docs): vite proxy + dead scheduler setter wired + registerAgent/CLI docs (C-1 master)
Closes six 2026-04-24 audit findings (3 P2 + 3 P3) — a cleanup-and-doc
tail bundle that drains the smallest remaining leaves of the audit:

  - cat-u-vite_dev_proxy_plaintext_drift (P2): web/vite.config.ts
    proxied dev requests to http://localhost:8443 against an HTTPS-only
    backend (HTTPS-only since v2.0.47). Every dev-server API call 502'd.
    Fix: targets are now object-form `{target: 'https://...', secure: false,
    changeOrigin: true}` — the dev cert is self-signed by the
    deploy/test bootstrap and changes per-checkout.

  - cat-g-7e38f9708e20 (P3): Scheduler.SetShortLivedExpiryCheckInterval
    was defined + tested but never called from cmd/server/main.go.
    Operators tuning CERTCTL_SHORT_LIVED_EXPIRY_CHECK_INTERVAL got
    no effect — the 30s default in scheduler.NewScheduler was
    effectively hardcoded. Fix: added Config.Scheduler.ShortLivedExpiryCheckInterval
    + getEnvDuration in Load() reading the env var with a 30s default,
    + sched.SetShortLivedExpiryCheckInterval(...) call in main.go
    alongside the other scheduler-interval setters.

  - diff-10xmain-2bf4a0a60388 (P3): same root cause as cat-g-7e38f9708e20;
    closes as ride-along.

  - cat-b-6177f36636fb (P2): registerAgent client fn orphan. By-design
    per pull-only deployment model. Fix (audit recommendation:
    "document"): added a closure docblock above the export in
    client.ts + a new "Registration is by-design pull-only" paragraph
    in docs/architecture.md::Agents section explaining when/why a
    future GUI-driven enrollment feature might reach the endpoint
    (proxy-agent topologies for network appliances).

  - cat-i-7c8b28936e3d (P2): CLI scope intentionally narrow but
    undocumented. Fix: new "Scope (intentionally narrow)" subsection
    in docs/features.md::CLI capturing the SSH-into-prod / day-to-day
    GUI / AI-automation MCP three-way split.

Verification:
- go build ./... — clean
- go vet ./... — clean
- go test ./internal/scheduler/... ./internal/config/... — pass
- golangci-lint v2.11.4 run ./... — 0 issues
- tsc --noEmit (frontend) — clean
- All sibling guardrails (S-1 / G-3 / D-1+D-2 / B-1 / L-1 / H-1) still pass

Audit findings closed:
- cat-u-vite_dev_proxy_plaintext_drift (P2)
- cat-g-7e38f9708e20 (P3)
- diff-10xmain-2bf4a0a60388 (P3)
- cat-b-6177f36636fb (P2)
- cat-i-7c8b28936e3d (P2)
- (audit-bookkeeping ride-along: ensures every closed-bundle row has a non-empty merge SHA)

Deferred follow-ups: none from this bundle. The remaining audit
backlog (frontend test campaign, F-1 CertificatesPage UX, P-1
orphan-fn sweep, S-2 handler error-mapping refactor) is sibling
sub-bundles in this mega-prompt.
2026-04-25 17:34:59 +00:00
shankar0123 a39f5af22a Merge branch 'fix/h1-master-security-hardening-trio' (H-1 master, 3 audit findings) 2026-04-25 16:40:22 +00:00
shankar0123 3e78ecb799 feat(security): bodyLimit on noAuth + security headers + encryption-key validation (H-1 master)
Closes three 2026-04-24 audit findings (all P2):
  - cat-s5-4936a1cf0118: noAuthHandler chain accepted arbitrary-size
    bodies (EST simpleenroll, SCEP, PKI CRL/OCSP, /health, /ready).
    Memory exhaustion vector without HTTP-layer auth gatekeeping.
  - cat-s11-missing_security_headers: zero security headers on any
    response. Clickjacking, MIME-sniffing, untrusted-origin resource
    loads against the dashboard and API.
  - cat-r-encryption_key_no_length_validation: CERTCTL_CONFIG_ENCRYPTION_KEY
    accepted with any non-empty value including a single character.
    PBKDF2-SHA256 (100k rounds) does not compensate for low-entropy
    passphrases at scale (CWE-916, CWE-329).

Changes:
- cmd/server/main.go::noAuthHandler chain — added bodyLimitMiddleware
  + securityHeadersMiddleware. Same default cap as authed surface
  (1MB via CERTCTL_MAX_BODY_SIZE), same 413 on overflow.
- cmd/server/main.go::middlewareStack (authed) — added
  securityHeadersMiddleware before corsMiddleware.
- internal/api/middleware/securityheaders.go (new) — SecurityHeaders
  middleware + SecurityHeadersDefaults() with conservative defaults:
  HSTS 1y+includeSubDomains, X-Frame-Options DENY, X-Content-Type-
  Options nosniff, Referrer-Policy no-referrer-when-downgrade, CSP
  default-src 'self' + img/data + style 'unsafe-inline' (Tailwind/Vite
  needs it; scripts still 'self' only) + connect 'self' + frame-
  ancestors 'none'. Operators behind a customising reverse proxy can
  disable any header by setting its config field to empty.
- internal/config/config.go::Validate() — enforce minEncryptionKeyLength
  = 32 bytes when CERTCTL_CONFIG_ENCRYPTION_KEY is set. Empty stays
  accepted (downstream fail-closed sentinel handles it). Structured
  error names the env var, the actual length, the required minimum,
  and the canonical generation command (`openssl rand -base64 32`).

Tests:
- internal/api/middleware/securityheaders_test.go (new) — 4 cases
  (defaults present, empty value disables single header, override
  applied, headers on 4xx/5xx).
- internal/config/config_test.go — 5 new cases for the encryption-key
  length check (empty accepted, 1-byte rejected, 31-byte rejected at
  boundary, 32-byte accepted, 44-byte realistic operator key accepted).

Documentation:
- CHANGELOG.md — H-1 section above D-2 under [unreleased] with
  Breaking-change callout (operators with low-entropy keys must rotate
  before upgrade).
- coverage-gap-audit-2026-04-24-v5/unified-audit.md — Live Tracker
  25/47 → 33/47, P1 14/14 (zero remaining), P2 11/27 → 16/27. Three
  H-1 findings flipped + closed-bundle row added.

Verification:
- go build ./... — clean
- go vet ./... — clean
- golangci-lint v2.11.4 run ./... — 0 issues
- go test ./internal/api/middleware/... — pass (incl. 4 new
  SecurityHeaders cases)
- go test ./internal/config/... — pass (incl. 5 new EncryptionKey
  cases)
- tsc --noEmit (frontend) — clean
- All sibling guardrails (S-1 / G-3 / D-1 / D-2 / B-1 / L-1) still pass

Audit findings closed:
- cat-s5-4936a1cf0118 (P2)
- cat-s11-missing_security_headers (P2)
- cat-r-encryption_key_no_length_validation (P2)

Breaking change:
- Operators with CERTCTL_CONFIG_ENCRYPTION_KEY shorter than 32 bytes
  must rotate before upgrade. Generate via `openssl rand -base64 32`.

Deferred follow-ups:
- Weak-key dictionary check (reject password123, common ASCII patterns)
  — adds operational friction with low marginal entropy gain at the
  32-byte minimum.
- CSP 'unsafe-inline' for styles — required for Tailwind/Vite
  per-component <style> blocks; removing requires HTML report or
  component refactor outside H-1 scope.
- Permissions-Policy header — dashboard uses no advanced browser APIs
  (camera, mic, geolocation); deferred until a real consumer needs it.
2026-04-25 16:40:21 +00:00
shankar0123 24f25353f8 Merge branch 'fix/i2-mcp-discovered-cert-completeness' (I-2 closure, last P1) 2026-04-25 16:33:56 +00:00
shankar0123 25c34ace45 feat(mcp): add claim_discovered + dismiss_discovered MCP tools (I-2 closure)
Closes the LAST P1 in the 2026-04-24 audit (cat-i-b0924b6675f8). Pre-I-2
the README claimed "all API endpoints are exposed via MCP" but the
discovered-certificate lifecycle (HTTP handlers ClaimDiscovered +
DismissDiscovered at internal/api/handler/discovery.go:125,162) had
zero MCP tool wrappers — operators using Claude / Cursor / similar
MCP clients had no path to bring an out-of-band cert under management
or to mark a benign discovery as not-of-interest without dropping to
the REST API directly. The audit's count of 0 MCP discovery tools
was correct: `grep -niE 'discover|claim|dismiss' internal/mcp/tools.go`
returned only the pre-existing agent-retire tool's description text
mentioning sentinel discovery agents — no actual discovery-tool
registrations.

Added in internal/mcp/types.go:
- ClaimDiscoveredCertificateInput (id + managed_certificate_id)
- DismissDiscoveredCertificateInput (id)

Both follow the existing Go-doc / staticcheck convention (lead with
the type name + brief; closure-rationale prose follows). Pinned by
the existing L-1 staticcheck-fix lesson.

Added in internal/mcp/tools.go (slotted at end of file, after
certctl_auth_check):
- certctl_claim_discovered_certificate — POST /api/v1/discovered-certificates/{id}/claim
- certctl_dismiss_discovered_certificate — POST /api/v1/discovered-certificates/{id}/dismiss

Both wrap the existing HTTP handlers via the generic c.Post helper.
No backend changes; no openapi.yaml changes (both ops were already
in the spec from earlier work).

The audit's third name "acknowledge" is NOT closed: at recon, no
notification-acknowledge HTTP handler exists in the API surface
(grep across internal/api/handler/ returned zero hits for
"acknowledge"). The audit appears to have mis-quoted; "acknowledge"
isn't a real backend endpoint to wrap. If a future feature adds
notification acknowledgement, register it in the same shape.

Verification:
- go build ./... — clean
- go vet ./internal/mcp/... — clean
- go test ./internal/mcp/... -count=1 — pass
- golangci-lint v2.11.4 run ./... — 0 issues
- MCP tool count went from 85 → 87 (verify via `grep -cE 'gomcp\.AddTool\(' internal/mcp/tools.go`)
- S-1 + G-3 + D-1 + D-2 + B-1 + L-1 CI guardrails all still pass

Audit findings closed:
- cat-i-b0924b6675f8 (P1, MCP discovery completeness — last P1 in audit)

This brings the audit to ZERO REMAINING P1s.

Deferred follow-ups:
- Notification acknowledge MCP tool — add when a notification-ack
  HTTP handler exists. Currently no such handler exists in the
  API surface; treat as a separate feature, not an MCP gap.
2026-04-25 16:33:56 +00:00
shankar0123 5e4eaa78b1 Merge branch 'fix/g3-master-env-var-docs-drift' (G-3 master, 3 audit findings) 2026-04-25 16:31:46 +00:00
shankar0123 2419f8cd27 docs(features): reconcile env-var inventory with config.go (G-3 master)
Closes three 2026-04-24 audit findings (all P2, all category cat-g):

  - cat-g-renewal_check_interval_rename_drift: features.md:152
    advertised CERTCTL_RENEWAL_CHECK_INTERVAL but config.go renamed
    that to CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL. Fixed in prose
    + the scheduler-loops table on line 1117.

  - cat-g-b8f8f8796159: 6 env vars in config.go that were never
    documented:
      CERTCTL_DATABASE_MIGRATIONS_PATH
      CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT
      CERTCTL_JOB_AWAITING_CSR_TIMEOUT
      CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL
      CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL
      CERTCTL_SCHEDULER_NOTIFICATION_PROCESS_INTERVAL
    Added to the scheduler-loops table at features.md:1117 and
    (DATABASE_MIGRATIONS_PATH) to the new Database Schema preamble.

  - cat-g-163dae19bc59: 37 env vars in docs not defined in config.go.
    The audit's strict comm over-flagged this set: most "phantoms"
    are integration-surface contracts (script env vars certctl
    EXPORTS to user-provided ACME DNS-01 / OpenSSL CA scripts;
    StepCA / Webhook per-issuer-or-notifier config-blob field
    names; CERTCTL_QA_* test fixtures; agent-side env vars defined
    in cmd/agent/main.go). The closure narrows the gate to the
    one true phantom (the rename) and allowlists the documented
    integration contracts in the CI guard. Each allowlist entry
    has a one-line justification.

CI regression guardrail:
- .github/workflows/ci.yml::"Forbidden env-var docs drift regression
  guard (G-3)" — runs `comm -23` both ways between the env vars
  defined in Go source (config.go + cmd/* + ACME DNS export +
  test fixtures) and env vars mentioned in README + docs/ +
  deploy/helm/. Fails the build if either set is non-empty modulo
  the documented integration-surface allowlist.

Verification:
- comm -23 docs vs defined → empty post-fix (allowlist applied)
- comm -23 defined vs docs → empty post-fix
- golangci-lint v2.11.4 run ./... → 0 issues
- tsc --noEmit → clean
- S-1 stale-counts guardrail still passes

Audit findings closed:
- cat-g-163dae19bc59 (P2, docs-only env vars)
- cat-g-b8f8f8796159 (P2, config-only env vars)
- cat-g-renewal_check_interval_rename_drift (P2, renamed env var still in docs)

Deferred follow-ups:
- The 26 documented-but-unimplemented integration contracts on the
  allowlist (CERTCTL_OPENSSL_*, CERTCTL_ACME_EAB_*, CERTCTL_WEBHOOK_*,
  CERTCTL_AUDIT_EXCLUDE_PATHS, CERTCTL_TLS_*, CERTCTL_ACME_DNS_PROPAGATION_WAIT)
  are documented in features.md / connectors.md / demo-advanced.md but
  not yet read by any Go source. Either implement in config.go (each is
  its own M-X) or delete from docs (separate cleanup PR). Neither
  expansion fits inside G-3's "reconcile drift" scope.
2026-04-25 16:31:45 +00:00
shankar0123 6f045293e9 Merge branch 'fix/s1-master-stale-counts' (S-1 master, 2 audit findings) 2026-04-25 16:26:54 +00:00
shankar0123 530da674f8 docs(README,features,examples): replace stale source counts with rebuild commands (S-1 master)
Closes two 2026-04-24 audit findings — one P1 (cat-s1-9ce1cbe26876,
README + features.md cite stale numeric counts) and one P2
(cat-s1-features_md_issuer_count_contradiction, features.md self-
disagreed on issuer count saying 9 in two places + 12 in two others).
Both root in a CLAUDE.md invariant: "Numeric claims about current
state rot the instant the next release lands... Before adding any
current-state count, delete it and write the command instead."

Per-site changes:
- docs/features.md::"At a Glance" table — replaced 12 hardcoded counts
  with `rebuild via <command>` references quoting the canonical
  source-of-truth grep from CLAUDE.md::"Current-state commands".
- docs/features.md::Issuer Connectors section — dropped "9 issuer
  connectors" (stale; live: 12) and "12 IssuerType constants" prose;
  prose now references the rebuild command.
- docs/features.md::Target Connectors section — same treatment for
  "14 target connector types".
- docs/features.md::"Per-type config schema validation for all 9
  issuer types" — same treatment.
- docs/features.md::"80 MCP tools covering all API endpoints" — same.
- docs/features.md::Web Dashboard section — dropped "24 pages wired"
  + the "(25 Route elements, 24 pages)" comment.
- docs/examples.md::"Beyond These Examples" — dropped "7 issuer
  backends and 10 target connectors" prose; references features.md
  and the rebuild commands.

CI regression guardrail:
- .github/workflows/ci.yml::"Forbidden hardcoded source-count prose
  regression guard (S-1)" — grep-fails the build if any of the
  blocked phrases (e.g. "9 issuer connectors", "21 database tables",
  "80 MCP tools") reappears in README or docs/. Allowlists demo-
  fixture prose ("32 certificates" — seed_demo.sql facts), historical
  WORKSPACE-CHANGELOG counts, the testing-guide example phrasing,
  and any number adjacent to a quoted rebuild command.

Verification:
- S-1 guardrail dry-run on post-fix tree → empty (good)
- golangci-lint v2.11.4 run ./... → 0 issues
- tsc --noEmit → clean
- vitest, vite build unchanged from pre-S-1 baseline (no JS/TS touched)

Audit findings closed:
- cat-s1-9ce1cbe26876 (P1, README + features.md stale numeric counts)
- cat-s1-features_md_issuer_count_contradiction (P2, features.md
  self-contradiction on issuer count)

Deferred follow-ups:
- WORKSPACE-CHANGELOG.md historical-milestone counts intentionally
  preserved (those are point-in-time facts about shipped slices, not
  current-state claims). README demo-fixture counts ("32 certs, 10
  issuers") preserved — those describe the seed_demo.sql shape, not
  the live source surface.
2026-04-25 16:26:44 +00:00
shankar0123 555eef449e Merge branch 'fix/d2-master-type-drift-cluster' (D-2 master, 5 audit findings) 2026-04-25 16:07:36 +00:00
shankar0123 55eb7135be fix(web,ci): close TS↔Go type drift across 5 entities (D-2 master)
Closes five 2026-04-24 audit findings (all P2, all category cat-f /
diff-05x06-*) by reconciling the TypeScript interfaces in
web/src/api/types.ts with the on-wire JSON shape Go's
internal/domain/*.go structs actually emit. D-1 closed the same pattern
for one entity (Certificate / ManagedCertificate); D-2 covers the
remaining five.

Per-entity verdicts (audit's "stricter side is the contract"):

  Agent       — TRIM 5 phantoms (last_heartbeat, capabilities, tags,
                created_at, updated_at). Go emits last_heartbeat_at only.
  Target      — ADD 2 (retired_at?, retired_reason?) — I-004 fields.
  DiscCert    — ADD pem_data? — real field, real Go emit, omitempty.
  Issuer      — TRIM phantom status. Go has Enabled bool only.
  Notif       — TRIM phantom subject. Go has Message string only.
  Certificate — verify-only; D-1 closure confirmed clean at recon.

Consumer fixes (same commit as the trim):
- AgentDetailPage.tsx — remove dead Capabilities + Tags sections (always
  rendered empty); replace agent.created_at/updated_at row with the
  Go-emitted registered_at; widen heartbeatStatus() to accept undefined.
- AgentsPage.tsx — same heartbeatStatus widening.
- IssuersPage.tsx + IssuerDetailPage.tsx — issuerStatus() now derives
  from `enabled` exclusively; the dead `issuer.status || 'Unknown'`
  fallback is gone.
- NotificationsPage.tsx — drop dead `|| n.subject` fallback.
- NotificationsPage.test.tsx — drop dead `subject:` from mocks.
- api/utils.ts::timeAgo widened to accept string | undefined | null.
- api/types.test.ts — Agent (I-004) fixture trimmed of the 5 phantoms.

Tests (Vitest):
- 5 new describe blocks in web/src/api/types.test.ts:
  - Agent interface (D-2 phantom-fields trim) — 2 it blocks
  - Target interface (D-2 retirement fields) — 2 it blocks
  - DiscoveredCertificate interface (D-2 pem_data ADD) — 2 it blocks
  - Issuer interface (D-2 status phantom trim) — 1 it block
  - Notification interface (D-2 subject phantom trim) — 1 it block
- Each block uses the literal-construction pattern from D-1; trimmed
  fields are pinned via excess-property comments that compile-fail when
  uncommented if a phantom is reintroduced.

CI regression guardrail:
- .github/workflows/ci.yml — existing D-1 step renamed to "Forbidden
  StatusBadge dead-key + TS phantom-field regression guard (D-1 + D-2)".
  Three new awk-windowed greps over Agent / Issuer / Notification
  interfaces in types.ts. The Agent grep includes a `grep -v
  'last_heartbeat_at'` filter to avoid false positives on the
  legitimate Go-emitted heartbeat field.

Documentation:
- CHANGELOG.md — new D-2 section above B-1 under [unreleased] with full
  Added/Removed/Audit findings closed/Known follow-ups breakdown.
- docs/architecture.md — Web Dashboard section gains a new "TS ↔ Go
  type contract rule (D-1 + D-2 closure)" paragraph capturing the
  stricter-side-wins rule and the CI guardrail it's anchored by.
- coverage-gap-audit-2026-04-24-v5/unified-audit.md — Live Tracker score
  20/47 → 25/47 (P2: 6/27 → 11/27). Per-finding  RESOLVED Status
  blocks added to all 5 diff-05x06-* entries plus the verify-only
  Certificate entry. Closed-bundle index gets D-2 row.

Verification (all gates green):
- cd web && tsc --noEmit                 → clean
- cd web && vitest run --reporter=dot    → 9 files, 302 tests passing
                                            (was 294 → +8 D-2 cases)
- cd web && vite build                   → clean
- go vet ./internal/... ./cmd/...        → clean (no Go touched)
- golangci-lint v2.11.4 run ./...        → 0 issues
- D-2 Agent guardrail dry-run            → empty (good)
- D-2 Issuer guardrail dry-run           → empty (good)
- D-2 Notification guardrail dry-run     → empty (good)
- D-2 Target ADD-shape sanity            → 2 retirement fields present
- D-2 DiscCert ADD-shape sanity          → pem_data present
- D-1 Certificate guardrail still clean  → empty (good)
- OpenAPI YAML parses                    → 89 paths

Audit findings closed:
- diff-05x06-7cdf4e78ae24 (P2, Agent TS↔Go drift)
- diff-05x06-2044a46f4dd0 (P2, Target TS↔DeploymentTarget Go drift)
- diff-05x06-85ab6b98a2f7 (P2, DiscoveredCertificate TS↔Go drift)
- diff-05x06-97fab8783a5c (P2, Issuer TS↔Go drift)
- diff-05x06-caba9eb3620e (P2, Notification TS↔NotificationEvent drift)
- diff-05x06-af18a8d7ef41 (P2) — verified clean since D-1; no edit

Deferred follow-ups:
- Issuer richer status view (enabled × test_status) — UX scope, not drift.
- Real Agent metadata (capabilities, tags) — backend feature, not drift.
- DiscoveredCertificate pem_data list-response perf — separate backend change.
2026-04-25 16:07:31 +00:00
shankar0123 2edac7e78b fix(mcp): close staticcheck ST1021 on BulkRenew/BulkReassign input docstrings
CI on the B-1 merge (b8a4318) failed at the golangci-lint step on two
ST1021 errors against internal/mcp/types.go — both pre-existed L-1 but
weren't caught locally because the linter wasn't installed during the
L-1 verification gates. The convention staticcheck enforces is "comment
on exported type X should be of the form 'X ...'" — i.e. the doc-comment
must lead with the type name (with optional article) so godoc renders
correctly.

  Before:  // L-1 master closure (cat-l-fa0c1ac07ab5): bulk-renew MCP tool input.
  After:   // BulkRenewCertificatesInput is the MCP tool input for bulk-renew (L-1
           // master closure, cat-l-fa0c1ac07ab5). Mirrors BulkRevokeCertificatesInput
           // field-for-field minus Reason.

Same shape applied to BulkReassignCertificatesInput. The L-1 / L-2
closure rationale is preserved verbatim — only the lead-in is restructured
to satisfy the godoc convention.

Verification:
- golangci-lint v2.11.4 (matching CI) installed locally at /dev/shm/bin
- golangci-lint run ./... --timeout 5m → 0 issues
- internal/mcp/... package targeted lint → 0 issues

This unblocks the B-1 CI run on master. No behavioral change; doc-only edit.
2026-04-25 15:48:39 +00:00
shankar0123 b8a4318082 Merge branch 'fix/b1-master-orphan-crud-edit-modals' (B-1 master, 4 audit findings) 2026-04-25 15:23:21 +00:00
shankar0123 097995e503 fix(web,ci): close orphan-CRUD GUI gaps + dead exportCertificatePEM (B-1 master)
Closes four 2026-04-24 audit findings via per-page Edit modals on five
existing pages, a brand-new RenewalPoliciesPage for the rp-* CRUD surface,
and removal of one dead duplicate so the public client surface stops
growing without consumers. Anchored by a CI grep guardrail that fails
the build if any of the eight previously-orphan client functions loses
its non-test page consumer or if exportCertificatePEM is resurrected.

Per-page Edit modals (mirroring existing CreateXModal scaffolding):
- web/src/pages/OwnersPage.tsx — EditOwnerModal (name/email/team_id)
- web/src/pages/TeamsPage.tsx — EditTeamModal (name/description)
- web/src/pages/AgentGroupsPage.tsx — EditAgentGroupModal (full match-rule
  set: name/description/match_os/match_architecture/match_ip_cidr/
  match_version/enabled)
- web/src/pages/IssuersPage.tsx — EditIssuerModal (rename-only; type
  locked, config blob preserved untouched, footer note about delete+
  recreate for credential rotation)
- web/src/pages/ProfilesPage.tsx — EditProfileModal (rename + description
  only; policy fields preserved untouched, footer note about deferred
  policy editing)

New page (closes cat-b-4631ca092bee — RenewalPolicy CRUD orphan):
- web/src/pages/RenewalPoliciesPage.tsx — full CRUD page with shared
  PolicyFormModal for Create + Edit (form shape identical), 7-column
  DataTable (Policy/RenewalWindow/Auto/Retries/AlertThresholds/Created/
  Actions), comma-separated alert_thresholds_days input parser, and
  alert() surfacing of repository.ErrRenewalPolicyInUse (409) on Delete
  so operators can re-target dependent certs before deletion.
- web/src/main.tsx — adds /renewal-policies route.
- web/src/components/Layout.tsx — adds sidebar nav item slotted between
  Policies and Profiles.

Removed (closes cat-b-9b97ffb35ef7 — dead duplicate):
- web/src/api/client.ts::exportCertificatePEM — zero consumers across
  web/, MCP, CLI, tests; downloadCertificatePEM is the actual call site
  in CertificateDetailPage. Test references in client.test.ts and
  client.error.test.ts also removed.

CI regression guardrail:
- .github/workflows/ci.yml — adds 'Forbidden orphan-CRUD client function
  regression guard (B-1)' step. Greps for all eight previously-orphan
  fns (updateOwner/updateTeam/updateAgentGroup/updateIssuer/updateProfile
  + createRenewalPolicy/updateRenewalPolicy/deleteRenewalPolicy) under
  web/src/pages/ and fails the build if any has zero non-test consumers.
  Also blocks resurrection of exportCertificatePEM. Verified locally
  (all 8 fns have ≥2 consumers; exportCertificatePEM is gone) and
  against synthetic regressions.

Documentation:
- CHANGELOG.md — new B-1 section above L-1 under [unreleased].
- docs/architecture.md — Web Dashboard section gains a new paragraph
  capturing the 'every backend CRUD must have a GUI consumer' rule
  with reference to the CI guardrail.
- coverage-gap-audit-2026-04-24-v5/unified-audit.md — flips four
  findings to  RESOLVED with detailed Status blocks; bumps Live
  Tracker score 16/47 → 20/47 (P1: 9→12, P3: 1→2); adds B-1 row to
  closed-bundle index.

Verification:
- cd web && tsc --noEmit — clean
- cd web && vitest run — 9 test files, 294 tests, all passing
- cd web && vite build — clean (no new warnings)
- B-1 guardrail dry-run — all 8 client fns have ≥2 page consumers,
  exportCertificatePEM removed (good), FAIL=0

Audit findings closed:
- cat-b-31ceb6aaa9f1 (P1, updateOwner/updateTeam/updateAgentGroup orphan)
- cat-b-7a34f893a8f9 (P1, updateIssuer/updateProfile orphan, rename-only)
- cat-b-4631ca092bee (P1, RenewalPolicy CRUD orphan)
- cat-b-9b97ffb35ef7 (P3, exportCertificatePEM dead duplicate)

Deferred follow-ups:
- Fuller EditIssuerModal with credential-rotation flow (needs threat
  model: rotation reuse window, in-flight CSR cancellation, audit-trail
  granularity).
- Fuller EditProfileModal with policy-field editing (max-TTL, allowed
  EKUs, allowed key algorithms — affect already-issued cert evaluation).
- Per-page Vitest coverage for the new Edit modals (CI grep guardrail
  catches the same regression vector at lower cost).
2026-04-25 15:23:15 +00:00
shankar0123 3fc1a2222f Merge branch 'fix/l1-master-bulk-action-endpoints' (L-1 master, 2 audit findings) 2026-04-25 14:33:10 +00:00
shankar0123 f0865bb051 fix(api,web,mcp): add bulk-renew + bulk-reassign endpoints, drop client-side N×HTTP loops (L-1 master)
Two audit findings, both category cat-l, both rooted in
web/src/pages/CertificatesPage.tsx. Pre-L-1 the GUI looped per-cert
HTTP calls — 100 selected certs = 100 sequential round-trips × ~50–200
ms each = a 5–20-second wedge during which the operator stared at a
progress bar. Post-L-1 each workflow is a single POST.

  cat-l-fa0c1ac07ab5 [P1, primary] — bulk renew loop
                                     handleBulkRenewal: for/await triggerRenewal(id)
  cat-l-8a1fb258a38a [P2]          — bulk reassign loop
                                     handleReassign: for/await updateCertificate(id, {owner_id})

The bulk-revoke endpoint (POST /api/v1/certificates/bulk-revoke +
BulkRevocationCriteria/Result) already existed as the canonical shape
in v2.0.x — L-1 ports that pattern to renew + reassign with per-action
twists.

Backend (Go)
- internal/domain/bulk_renewal.go: BulkRenewalCriteria mirrors
  BulkRevocationCriteria (criteria + IDs modes); BulkRenewalResult
  envelope adds EnqueuedJobs[] for per-cert {certificate_id, job_id};
  shared BulkOperationError type for all bulk paths.
- internal/domain/bulk_reassignment.go: narrower shape — IDs-only,
  owner_id required, team_id optional.
- internal/service/bulk_renewal.go::BulkRenewalService.BulkRenew:
  resolves criteria → status filter (Archived/Revoked/Expired/
  RenewalInProgress all silent-skip) → per-cert status flip + job
  create. Keygen-mode-aware so jobs land in the same initial status
  as single-cert TriggerRenewal. Single bulk audit event per call,
  not N.
- internal/service/bulk_reassignment.go::BulkReassignmentService.
  BulkReassign: validates owner_id upfront via the
  ErrBulkReassignOwnerNotFound typed sentinel — non-existent owner
  returns 400 before any cert is touched. Already-owned-by-target
  is silent-skip. Single bulk audit event.
- internal/api/handler/{bulk_renewal,bulk_reassignment}.go: HTTP
  shape mirrors bulk_revocation.go. NOT admin-gated (renew is non-
  destructive; reassign is a common-case workflow). Sentinel-error
  → 400 mapping for OwnerNotFound.
- internal/api/router/router.go: three bulk-* routes registered as a
  block before the {id} routes. HandlerRegistry gains BulkRenewal +
  BulkReassignment fields.
- cmd/server/main.go: NewBulkRenewalService threads cfg.Keygen.Mode
  so bulk-renew jobs land in same initial state as single-cert path.

Frontend
- web/src/api/client.ts: bulkRenewCertificates(criteria) +
  bulkReassignCertificates(request) functions with full TS types.
- web/src/pages/CertificatesPage.tsx: handleBulkRenewal + handleReassign
  rewritten from N-call loops to single calls. Result envelope drives
  progress UI; first-error message surfaced when total_failed > 0.
  Stale triggerRenewal + updateCertificate imports removed.

MCP
- internal/mcp/types.go: BulkRenewCertificatesInput +
  BulkReassignCertificatesInput.
- internal/mcp/tools.go: certctl_bulk_renew_certificates +
  certctl_bulk_reassign_certificates tools mirroring the existing
  certctl_bulk_revoke_certificates pattern.

OpenAPI
- api/openapi.yaml: two new operations (bulkRenewCertificates,
  bulkReassignCertificates) under Certificates tag. Four new schemas
  (BulkRenewRequest, BulkRenewResult, BulkEnqueuedJob,
  BulkReassignRequest, BulkReassignResult).

Tests
- Domain: BulkRenewalCriteria.IsEmpty + BulkReassignmentRequest.IsEmpty
  IsEmpty contracts; JSON round-trip shape pinning.
- Service: 7 BulkRenew tests (happy/criteria-mode/skips-RenewalInProgress/
  skips-revoked-archived/empty-criteria-error/partial-failure/
  audit-event-emitted) + 8 BulkReassign tests (happy/skips-already-
  owned/owner-required/empty-IDs/owner-not-found-sentinel/team-id-
  optional/team-id-provided/partial-failure/audit-event-emitted).
- Handler: 5 BulkRenew handler tests (happy/empty-body-400/wrong-
  method-405/actor-attribution/service-error-500) + 6 BulkReassign
  handler tests (happy/empty-IDs-400/missing-owner-400/owner-not-
  found-400-via-sentinel/wrong-method-405/generic-error-500).

CI guardrail
- .github/workflows/ci.yml: 'Forbidden client-side bulk-action loop
  regression guard (L-1)'. Greps web/src/pages/CertificatesPage.tsx
  for 'for(...) await triggerRenewal(...)' and 'for(...) await
  updateCertificate(...)' patterns; comment lines exempt; test files
  exempt. Verified locally (passes against post-fix tree, fires
  against synthetic regression).

Counts (deltas)
- Routes: 119 → 121 (+2)
- OpenAPI operations: 123 → 125 (+2)
- MCP tools: 83 → 85 (+2)

Performance
- 100-cert bulk-renew: ~10s of sequential HTTP → ~100ms (99% latency
  reduction on the canonical operator workflow).
- Audit event volume: 1 + N per operation → 1.

Out of scope (deferred follow-ups)
- cat-b-31ceb6aaa9f1: updateOwner/updateTeam/updateAgentGroup orphan
  (different shape — wire existing PUT to GUI, not new bulk endpoint).
- cat-k-e85d1099b2d7: CertificatesPage no pagination UI.
- cat-i-b0924b6675f8: MCP missing claim/dismiss/acknowledge (L-1 added
  2 new tools but does not close that finding).

Verification
- go build / vet / test -short / test -short -race all clean.
- web tsc --noEmit + vitest run all clean (296 tests passing).
- OpenAPI YAML parses (89 paths, 125 ops).
- L-1 CI guardrail passes against post-fix tree, fires against
  synthetic regression.

No push.
2026-04-25 14:33:02 +00:00
shankar0123 677524d9ec Merge branch 'fix/d1-master-statusbadge-enum-drift' (D-1 master, 5 audit findings) v2.0.51 v2.0.52 2026-04-25 13:53:02 +00:00
shankar0123 9dc0742e77 fix(web): close StatusBadge enum drift + Certificate TS phantom fields (D-1 master)
Five audit findings, all category cat-d or cat-f, all rooted in two
frontend files. The dashboard silently lied:

  cat-d-359e92c20cbf [P1, primary] — Agent: 'Stale' dead key + 'Degraded'
                                     neutral fallthrough
  cat-d-9f4c8e4a91f1 [P2]          — Notification: 'dead' missing
  cat-d-1447e04732e7 [P3]          — Cert: 'PendingIssuance' dead key
  cat-f-cert_detail_page_key_render_fallback [P2] — render-site reads
                                                    cert.key_algorithm directly
  cat-f-ae0d06b6588f [P2]          — Certificate TS phantom fields (root cause)

Pre-D-1, agents in the only Go AgentStatus that means 'needs operator
attention' (Degraded) rendered as default neutral grey because StatusBadge
mapped 'Stale' (a key Go has never emitted) to yellow. Dead-letter
notifications visually equated with 'read' (operator-acknowledged). The
Certificate badge map carried a 'PendingIssuance' key no Go enum emits.
CertificateDetailPage's Key Algorithm and Key Size rows always rendered
'—' even when the data was a single fetch away — the lookup went through
cert.key_algorithm / cert.key_size directly, both phantom Certificate TS
fields. Trim the TS type so the missing-data case is explicit; fix the
render site to use latestVersion?.field; pin the contract with a 38-case
Vitest property test that walks every Go enum.

StatusBadge (web/src/components/StatusBadge.tsx)
- Drop 'Stale' (Agent dead key) + 'PendingIssuance' (Cert dead key).
- Add 'Degraded' (Agent → badge-warning) + 'dead' (Notification → badge-danger).
- Add leading docblock naming Go-side source-of-truth file for every
  status family and pointing at the property test as regression vector.

Property test (web/src/components/StatusBadge.test.tsx — 38 cases)
- Iterates every Go-emitted enum value (AgentStatus, CertificateStatus,
  JobStatus, NotificationStatus, DiscoveryStatus, HealthStatus) plus the
  two frontend-synthesized Enabled/Disabled labels, asserts every value
  gets a non-default class (or an explicit 'badge badge-neutral' for the
  five intentionally-neutral terminal values: Archived, Cancelled,
  Dismissed, read, unknown).
- Negative assertions: 'Stale' and 'PendingIssuance' must fall through
  to the dictionary default — re-adding either key surfaces here.
- Specific UX-correctness assertions: 'dead' → badge-danger,
  'Degraded' → badge-warning.
- Unknown-status fallthrough preserves label text.

Certificate TS trim (web/src/api/types.ts)
- Drop serial_number?, fingerprint_sha256?, key_algorithm?, key_size?,
  issued_at? from Certificate. Go's ManagedCertificate has never carried
  these — they live on CertificateVersion. Post-trim a cert.X access for
  any of the five fields is a TS compile error.
- Leading docblock cross-references the closure rationale and the
  latestVersion fallback pattern.

Render-site fix (web/src/pages/CertificateDetailPage.tsx)
- Key Algorithm / Key Size rows now read latestVersion?.key_algorithm /
  latestVersion?.key_size, mirroring the existing latestVersion fallback
  used a few lines above for serial_number / fingerprint_sha256.
- The same edit also tightened the serial / fingerprint / issued_at
  derivations to drop the now-impossible 'cert.X || latestVersion?.X'
  cert-side leg (cert.serial_number is a TS error post-trim).

Type-test regression (web/src/api/types.test.ts)
- Certificate literal construction pinned post-trim — adding any of the
  five fields back makes the literal an excess-property TS error.
- Sibling CertificateVersion literal pinning the trimmed fields still
  live on the version envelope (so the CertificateDetailPage fallback
  path can't break).

OpenAPI (api/openapi.yaml)
- ManagedCertificate schema unchanged — was already correct (no phantom
  fields). Added a leading comment cross-referencing the D-5 closure for
  future readers.

CI guardrail (.github/workflows/ci.yml)
- 'Forbidden StatusBadge dead-key + Certificate phantom-field regression
  guard (D-1)'. Two grep blocks: catches Stale/PendingIssuance map
  literals in StatusBadge.tsx; uses an awk-scoped window over the
  'export interface Certificate {' block in types.ts to catch the five
  phantom fields reappearing while explicitly excluding CertificateVersion
  (which legitimately carries them). Comments + test files exempt.

Verification
- Backend build/vet/test -short -race all clean across handler/router/
  middleware packages.
- Frontend tsc --noEmit clean.
- Vitest 256 → 296 tests (+40: 38 from new StatusBadge test, 2 from D-5
  Certificate trim regression in types.test.ts).
- OpenAPI YAML parses (87 paths).
- Both CI guardrail patterns clear on the post-fix tree; both fire
  against synthetic regression patterns (re-add Stale → fires; re-add
  serial_number? to Certificate → fires).

Out of scope (deferred)
- diff-05x06-* type drifts for Agent/DeploymentTarget/Notification/
  DiscoveredCertificate/Issuer TS interfaces. Per-type field-by-field
  Go ↔ TS diff is codegen-shaped, not edit-shaped — warrants its own
  D-2 master prompt. Noted in CHANGELOG follow-ups section.
2026-04-25 13:52:54 +00:00
shankar0123 1440a30d28 Merge branch 'fix/u3-master-db-coupling-cleanup' (U-3 master + 4 ride-alongs) 2026-04-25 13:29:30 +00:00
shankar0123 a3d8b9c607 fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master)
GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the
canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build);
postgres reported unhealthy indefinitely, dependent containers never started.

Root cause: deploy/docker-compose.yml mounted a hand-curated subset of
migrations/*.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/.
Postgres applied them at initdb time. Once seed.sql referenced columns added
by migrations *after* the mounted cutoff (e.g., policy_rules.severity from
migration 000013), initdb crashed mid-seed and the container loop wedged.
Two sources of truth (compose mount list vs in-tree migration ladder)
diverged the moment a seed-touching migration shipped, and the only thing
that fixed it was hand-editing the compose file every release.

Fix: remove the dual source. Postgres boots empty; the server applies
migrations + seed at startup via RunMigrations + RunSeed. Helm has used
this pattern since day one (postgres-init emptyDir); compose now matches.

Bundled with four ride-along audit findings whose fixes share the same
schema/db code surface, so operators take the schema-change pain only once:

  cat-u-seed_initdb_schema_drift           [P1, primary] — initdb-mount fix
  cat-o-retry_interval_unit_mismatch       [P1] — column rename minutes→seconds
  cat-o-notification_created_at_dead_field [P2] — add column + populate
  cat-o-health_check_column_orphans        [P1] — drop unwired columns
  cat-u-no_version_endpoint                [P2] — add /api/v1/version

Single migration (000017_db_coupling_cleanup) bundles the three schema
changes under a DO \$\$ guard so re-application is safe; reduces
operator-visible 'schema-change releases' from four to one.

Backend
- internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed
  (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in
  every shipped INSERT) so repeated boots are safe; missing-file is no-op
  so custom packaging that strips seeds still boots cleanly.
- cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set)
  immediately after RunMigrations.
- internal/repository/postgres/notification.go: NotificationRepository.Create
  now sets created_at (with time.Now() fallback when caller leaves it zero);
  scanNotification reads it back; List + ListRetryEligible SELECT extended.
- internal/repository/postgres/renewal_policy.go: column references updated
  to retry_interval_seconds across SELECT/INSERT/UPDATE sites.
- internal/api/handler/version.go: new VersionHandler exposes
  {version, commit, modified, build_time, go_version} from
  runtime/debug.ReadBuildInfo() with ldflags-supplied Version override.
- internal/api/router/router.go: register GET /api/v1/version through the
  no-auth chain (CORS + ContentType) alongside /health, /ready,
  /api/v1/auth/info.
- cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit
  ExcludePaths so rollout polling doesn't dominate the audit trail.
- internal/config/config.go: add DatabaseConfig.DemoSeed +
  CERTCTL_DEMO_SEED env var.

Migration
- migrations/000017_db_coupling_cleanup.up.sql + .down.sql:
    (1) renewal_policies.retry_interval_minutes → retry_interval_seconds
        (DO \$\$ guard, idempotent re-application)
    (2) notification_events ADD COLUMN created_at TIMESTAMPTZ
        NOT NULL DEFAULT NOW()
    (3) network_scan_targets DROP orphan health_check_enabled +
        health_check_interval_seconds
- migrations/seed.sql: column reference updated to retry_interval_seconds.
- migrations/seed_demo.sql: same column rename + applied at runtime now via
  RunDemoSeed (no longer initdb-mounted).

Compose
- deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files +
  seed.sql); add start_period: 30s to postgres + certctl-server healthchecks
  to absorb the runtime migration + seed application window on first boot.
- deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount
  removed; that file never existed); same healthcheck start_period.
- deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with
  CERTCTL_DEMO_SEED=true env var on certctl-server.

Tests
- internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo,
  TestVersion_RejectsNonGet, TestVersion_LdflagsOverride.
- internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently,
  TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently,
  TestMigration000017_RetryIntervalRename,
  TestMigration000017_NotificationCreatedAt,
  TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips).
- internal/repository/postgres/notification_test.go:
  TestNotificationRepository_CreatedAt_IsPersisted +
  TestNotificationRepository_CreatedAt_DefaultsToNow.

CI guardrail
- .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb
  (U-3)' step grep-fails the build if any migrations/*.sql or seed*.sql
  re-appears in /docker-entrypoint-initdb.d in any compose file. Catches
  future drift before a fresh-clone operator hits it.

Spec / Docs
- api/openapi.yaml: add /api/v1/version operation under Health tag.
- docs/architecture.md: replace the 'initdb may run the same SQL' paragraph
  with a post-U-3 single-source-of-truth explanation.
- CHANGELOG.md: full unreleased-section entry covering all 5 closures,
  breaking changes, and the new env var.

Audit doc
- coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14
  cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to
   RESOLVED with closure prose pointing at this commit.

Verification: build/vet/test -short -race all clean across all touched
packages locally; govulncheck reports 0 vulnerabilities affecting our
code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the
post-fix tree.
2026-04-25 13:29:23 +00:00
shankar0123 aa6fafdee9 Merge branch 'fix/u2-dockerfile-healthcheck-https' v2.0.50 2026-04-25 12:02:28 +00:00
shankar0123 86fffa305a fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2)
Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image
shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`.
The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone
(`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS
1.3 pinned), so the probe failed on every interval and Docker marked
the container `unhealthy` indefinitely. Operators inside docker-
compose / Helm / the example stacks were unaffected — compose overrides
the HEALTHCHECK with `--cacert + https://`, Helm uses explicit
`httpGet` probes that ignore Docker's HEALTHCHECK, and every example
compose file overrides with `curl -sfk https://localhost:8443/health`.
But anyone running bare `docker run` / Docker Swarm / Nomad / ECS —
exactly the "I just pulled the published image" path — saw permanent
`unhealthy` status and (depending on orchestrator policy) a restart-
loop. (Audit: cat-u-healthcheck_protocol_mismatch in
coverage-gap-audit-2026-04-24-v5/unified-audit.md.)

Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone
gap, both bundled into this commit because they share the same root
cause and the same operator surface:

  1. Helm chart `server.readinessProbe.httpGet.path` pointed at
     `/readyz`, the kube-flavored convention. The certctl server
     doesn't register `/readyz` (only `/health` and `/ready` are
     wired and bypass the auth middleware — see
     internal/api/router/router.go:81 and cmd/server/main.go:920).
     K8s readiness probes therefore got 401 (api-key auth rejection)
     or 404 (when auth was disabled), pods stayed `NotReady`
     indefinitely, and Helm rollouts stalled.

  2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all,
     so bare-`docker run` agents got zero health signal. The
     compose override at `deploy/docker-compose.yml:173` called
     `pgrep -f certctl-agent` against the agent image, but the
     agent image didn't ship `procps` — pgrep was missing too. The
     compose probe was a latent always-fail.

We fixed all three with the audit-recommended shape (option (a) — `-k`)
plus three structural backstops:

Files changed:

Phase 1 — Dockerfile fix:
- Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/
  health` to `curl -fsk https://localhost:8443/health`. `-k`
  (insecure) is acceptable because the probe is localhost-to-localhost:
  the same process serving the cert is being probed, no network hop.
  Pinning `--cacert` is not viable for the published image because
  the bootstrap cert is per-deploy (generated into the `certs` named
  volume on first up; operator-supplied via Helm's `existingSecret`
  or cert-manager). Long-form docblock cross-references the audit
  closure, the compose vs Helm vs examples coverage matrix, and the
  CI guardrail.
- Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent`
  matching the compose pattern. Added `procps` to the runtime apk
  install — fixes both the new image-level HEALTHCHECK AND the
  pre-existing compose probe that was silently failing.

Phase 2 — Helm readiness probe path:
- deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path
  changed from `/readyz` to `/ready`. Liveness probe path
  (`/health`) was correct and is unchanged. Probes block now carries
  an explanatory comment naming the registered no-auth probe routes
  and the U-2 closure rationale.

Phase 3 — Image-level integration tests:
- deploy/test/healthcheck_test.go (new, //go:build integration):
  TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server
  image, inspects `Config.Healthcheck.Test` via `docker inspect`,
  and asserts the array contains `https://localhost:8443/health` and
  `-k`, and does NOT contain `http://localhost:8443/health`
  (positive + negative regression contracts).
  TestPublishedAgentImage_HealthcheckSpecExists builds the agent image
  and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`.
  Both tests `t.Skip` cleanly when docker isn't available (sandbox /
  CI without docker-in-docker) — verified locally: tests skip with the
  diagnostic and the suite returns PASS.
  TestPublishedServerImage_HealthcheckTransitionsToHealthy is a
  documented `t.Skip` placeholder until the harness wires a sidecar
  postgres for image-level smoke; the spec-level tests above cover the
  audit-flagged regression.

Phase 4 — CI guardrail:
- .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK
  regression guard (U-2)" step. Scoped patterns catch
  `HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health`
  in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md
  out of scope (the post-cutover invariant string at line 182 is
  intentionally a documented expected-failure assertion). Verified
  locally on the real tree (passes) and against synthetic regressions
  (each fires the guard).

Phase 5 — Docs sweep:
- docs/connectors.md: 15 stale curl examples updated from
  `http://localhost:8443/...` to `https://localhost:8443/...` with
  `--cacert "$CA"` injected on every site. Added a one-time
  introductory note documenting the `$CA` extraction with
  `docker compose ... exec ... cat /etc/certctl/tls/ca.crt`,
  matching the pattern in docs/quickstart.md. Pre-U-2 these examples
  silently failed against the HTTPS listener.

Phase 6 — Release surface:
- CHANGELOG.md: appended U-2 section to the existing [unreleased]
  block (immediately below the G-1 entry). Sections: explanatory
  blockquote covering all three bugs (primary + 2 adjacent), Fixed,
  Added, Changed.

Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go vet -tags integration ./deploy/test/ — clean
- go test -short ./... — every package green
- go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ —
  three tests SKIP cleanly with "docker not available" diagnostic
- helm lint deploy/helm/certctl/ — clean
- helm template smoke render — succeeds; rendered Deployment carries
  `path: /ready` and zero `/readyz` matches
- python3 yaml.safe_load on api/openapi.yaml — parses
- govulncheck ./... — no vulnerabilities in our code
- CI guardrail mirror: clean on real tree, fires on synthetic
  regression patterns

Out of scope (intentionally untouched):
- cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct,
  this finding does NOT propose adding back a plaintext listener.
- deploy/docker-compose.yml:126 HEALTHCHECK — already correct.
- deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct.
- All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already
  correct (they ALSO use `-fsk https://localhost:8443/health`).
- Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` +
  `path: /health`, correct.
- docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health`
  invariant line — that's the expected-failure assertion for the
  post-cutover state ("plaintext is gone, expect Connection refused");
  intentionally left intact.
- Go production code — this is purely a deploy-image / probe / docs /
  Helm-chart fix.

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-u-healthcheck_protocol_mismatch
      Audit recommendation followed verbatim: 'change Dockerfile:80
      to CMD curl -kf https://localhost:8443/health'.
2026-04-25 12:02:18 +00:00
shankar0123 e17788355b Merge branch 'fix/g2-apikey-hash-redaction' 2026-04-25 01:56:34 +00:00
shankar0123 87213128cc fix(security,domain): redact Agent.APIKeyHash from JSON wire shape (G-2)
Pre-G-2 internal/domain/connector.go::Agent::APIKeyHash was tagged
`json:"api_key_hash"` and shipped on every wire surface that returned
domain.Agent — GET /api/v1/agents (PagedResponse{Data: agents}),
GET /api/v1/agents/{id}, GET /api/v1/agents/retired, and the
POST /api/v1/agents registration response. Every authenticated client
(browser, CLI --json, MCP tool calls) received the SHA-256-of-the-API-key
string. The browser silently dropped it because web/src/api/types.ts
omits the field, but CLI and MCP consumers print full JSON so the hash
was visible there. Even though the value is a hash and not the plaintext
key, shipping it gives an attacker an offline brute-force target if the
API-key entropy is low (certctl doesn't enforce a minimum on operator-
supplied keys), and there's no business reason for any client to ever
receive it — the value is server-internal, used only for the lookup at
internal/repository/postgres/agent.go::GetByAPIKey. (Audit:
cat-s5-apikey_leak in coverage-gap-audit-2026-04-24-v5/unified-audit.md.)

We chose the audit's recommended fix (json:"-") plus a defense-in-depth
MarshalJSON plus a CI guardrail. Three layers because struct-tag
redaction alone is one rebase away from being silently reverted, the
custom MarshalJSON catches the case where a parent struct embeds Agent
under a different tag, and the CI grep blocks reintroduction at the spec
or frontend boundary even without a code review catching it.

Files changed:

Phase 1 — Domain redaction:
- internal/domain/connector.go: APIKeyHash tag flipped from
  `json:"api_key_hash"` to `json:"-"`. New Agent.MarshalJSON
  with value receiver + type-alias-recursion-break that explicitly
  zeroes APIKeyHash on the marshal-time copy. Long-form docblock
  explaining the G-2 closure rationale + cross-references to
  service.RegisterAgent (populator), repository.AgentRepository::
  GetByAPIKey (consumer), docs/architecture.md (DB-shape vs
  API-shape distinction), and the audit finding.

Phase 2 — Domain tests (5 test functions):
- internal/domain/connector_test.go: TestAgent_MarshalJSON_RedactsAPIKeyHash
  pins the marshal-boundary contract on a value receiver. ...RedactsViaPointer
  pins the *Agent path. ...RedactsInSlice pins the []Agent path that the
  ListAgents handler actually emits via PagedResponse. ...DoesNotMutateReceiver
  pins the by-value-receiver contract so a future refactor that switches
  to pointer-receiver gets caught. ...RoundTrip pins the wire-shape
  guarantee that APIKeyHash is dropped on encode and cannot reappear on
  decode. Single sentinel value ("sha256:LEAKED-CREDENTIAL-DERIVATIVE-
  SENTINEL") flows through every fixture for grep-ability on regression.

Phase 3 — Handler tests (4 test functions):
- internal/api/handler/agent_handler_test.go: TestListAgents_DoesNotLeakAPIKeyHash,
  TestGetAgent_DoesNotLeakAPIKeyHash, TestRegisterAgent_DoesNotLeakAPIKeyHash,
  TestListRetiredAgents_DoesNotLeakAPIKeyHash. Each asserts (a) the
  literal substring "api_key_hash" is absent from the httptest-captured
  body, (b) the leak sentinel value is absent, (c) the non-leaked fields
  ARE present (sanity that the handler is serving real data, not just
  empty payloads). Shared sentinel "sha256:LEAKED-CREDENTIAL-DERIVATIVE-
  HANDLER-SENTINEL" so a single grep over a failing test's output
  identifies the leak surface immediately.

Phase 4 — Spec / docs:
- api/openapi.yaml: api_key_hash property REMOVED from Agent schema
  (was at line 3690). Inline G-2 comment naming the closure + the
  database-vs-API-shape distinction so a future spec edit doesn't
  silently re-introduce the field.
- docs/architecture.md: ER-diagram block already documents the agents
  table including api_key_hash (DB shape — correct). Added a sibling
  note paragraph immediately below the diagram explaining that several
  columns are intentionally server-internal (api_key_hash redaction
  + issuers.config / deployment_targets.config encrypted shadow), with
  cross-references to the redaction enforcement site, the OpenAPI
  schema, the frontend interface, and the CI guardrail.
- web/src/api/types.ts: Agent interface unchanged in shape (already
  omitted the field) but added a leading comment block explaining
  WHY the omission is intentional — stops a future frontend dev from
  "completing" the interface from the OpenAPI spec or the Go struct.

Phase 5 — CI guardrail:
- .github/workflows/ci.yml: new "Forbidden api_key_hash JSON-shape
  regression guard (G-2)" step. Scoped patterns catch the actual
  regression shapes — Go struct tag (json:"api_key_hash"), frontend
  interface declaration, OpenAPI schema property, YAML enum/array
  membership. Repository / migration / seed / service / integration /
  unit-test / comment lines exempt. Verified locally on the real tree
  (passes) and against 4 synthetic regression patterns (each fires
  the guardrail). Mirrors the G-1 pattern from .github/workflows/
  ci.yml lines 47-108.

Phase 5b — Sweep verification (no changes, results documented for the
next reader):
- internal/api/middleware/audit.go: doesn't serialize Agent struct;
  records request body only. No leak.
- service.RegisterAgent audit-event payload: `map[string]interface{}{
  "name": name, "hostname": hostname}` — name + hostname only,
  no APIKeyHash. No leak.
- All 9 slog sites that mention agent: scalar attrs only ("agent_id",
  "error", "agent_hostname"), never the full struct. No leak.
- internal/mcp, internal/cli, cmd/cli, cmd/mcp-server: zero matches
  for APIKeyHash / api_key_hash. Both pass server JSON verbatim, so
  the wire-side fix transitively closes them.

Verification (all gates pass):
- go build ./...
- go vet ./...
- go test -short ./... — every package green
- go test -short -race ./internal/domain/... ./internal/api/handler/... — clean
- govulncheck ./... — no vulnerabilities in our code
- helm lint deploy/helm/certctl/ — clean
- helm template smoke render — succeeds
- python3 yaml.safe_load on api/openapi.yaml — parses
- OpenAPI Agent schema scan: no api_key_hash property
- CI guardrail mirror: clean on real tree, fires on all 4 synthetic
  regression patterns
- Domain pkg coverage: Agent.MarshalJSON 100%, connector.go total 87.5%
- Handler pkg coverage: 79.2%

Sample response body (httptest captured during verification, GET
/api/v1/agents/{id} via the new handler test):

  {"id":"agent-demo","name":"demo-agent","hostname":"demo.host",
  "status":"Online","last_heartbeat_at":"2026-04-24T11:59:30Z",
  "registered_at":"2026-04-24T12:00:00Z","os":"linux",
  "architecture":"amd64","ip_address":"10.0.0.42",
  "version":"v2.0.49"}

Note the absence of any api_key_hash key, even though the in-memory
struct passed to the handler had APIKeyHash set to a sentinel.

Out of scope (intentionally untouched):
- internal/repository/postgres/agent.go SELECT/INSERT/UPDATE/scan
  paths and GetByAPIKey lookup — DB column stays, repo still
  populates the struct, auth lookup still works. The redaction is a
  marshal-boundary concern.
- migrations/000001_initial_schema.up.sql + migrations/seed_*.sql —
  DB schema and seed data unchanged.
- internal/service/agent.go::RegisterAgent — service-side hashing
  and persistence unchanged.
- Other domain types with potential credential-derivative fields
  (Issuer.Config, DeploymentTarget.Config, notifier configs). Not
  flagged by the audit; some are already protected (e.g.,
  DeploymentTarget.EncryptedConfig []byte `json:"-"`). File a
  separate audit pass if recon surfaces additional leaks.
- Per-resource DTO layer across every handler. Single audit
  finding, single domain type.
- A separate possible follow-up: the v2 RegisterAgent endpoint
  doesn't return the plaintext API key to the agent, which may
  mean self-bootstrap via POST /api/v1/agents is broken. Verified
  during recon; out of scope for G-2; should be its own ticket.

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-s5-apikey_leak
      Audit recommendation: 'json:"-" or API-response DTO
      excluding APIKeyHash' — went with the json:"-" + MarshalJSON
      defense-in-depth pair plus CI guardrail and structural docs.
2026-04-25 01:56:26 +00:00
shankar0123 697fa792ea Merge branch 'fix/g1-jwt-silent-auth-downgrade-removal' 2026-04-25 00:22:33 +00:00
shankar0123 9c1d446e40 fix(security,config): remove unimplemented JWT auth-type, close silent downgrade (G-1)
The pre-G-1 config validator accepted CERTCTL_AUTH_TYPE=jwt and the
startup log faithfully echoed 'authentication enabled type=jwt'.
Reasonable people read that and concluded JWT auth was on. It wasn't.
The auth-middleware wiring at cmd/server/main.go unconditionally routed
every request through the api-key bearer middleware regardless of
cfg.Auth.Type. So CERTCTL_AUTH_TYPE=jwt quietly compared the incoming
'Authorization: Bearer <token>' against whatever string the operator put
in CERTCTL_AUTH_SECRET — real JWT clients got 401, and operators who
treated CERTCTL_AUTH_SECRET as a *signing* secret (because they thought
they were configuring JWT) had effectively handed an attacker an api-key.
A security finding masquerading as a config option.

We chose the audit-recommended structural fix: remove the option, fail
fast at startup, and add the gateway-fronting pattern as the documented
forward path. Implementing JWT middleware would have meant jwks vs
static-secret rotation, claim mapping, expiry enforcement, audience and
issuer validation, key rollover semantics, and regression coverage at the
same depth as the existing api-key path — a feature, not a fix. Operators
who genuinely need JWT/OIDC front certctl with an authenticating gateway
(oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium /
Authelia) and run the upstream certctl with CERTCTL_AUTH_TYPE=none. Same
shape works on docker-compose and Helm.

The change is comprehensive across 7 phases — every surface that
mentioned 'jwt' as a certctl-auth-type is updated, plus structural
backstops (typed enum, runtime guard, helm template validation, CI grep
guard) so the lie can't reappear.

Files changed:

Phase 1 — production code (typed enum + jwt removal):
- internal/config/config.go: AuthType typed alias + AuthTypeAPIKey /
  AuthTypeNone constants + ValidAuthTypes() helper. Validate() routes
  literal 'jwt' through a dedicated multi-line diagnostic naming the
  authenticating-gateway pattern, then cross-checks against
  ValidAuthTypes(). Secret-required branch simplified to api-key-only.
  Field comment on AuthConfig.Type rewritten to drop jwt and point at
  the gateway pattern.
- internal/api/middleware/middleware.go: AuthConfig.Type field comment
  references the typed config.AuthType constants.
- internal/api/handler/health.go: same treatment for HealthHandler.AuthType.
- cmd/server/main.go: defense-in-depth runtime switch immediately after
  config.Load() — exits 1 on any unsupported auth-type that bypassed the
  validator. Auth-disabled startup log explicitly names the
  authenticating-gateway pattern.

Phase 2 — tests (Red→Green, contract pinning):
- internal/config/config_test.go: TestValidate_JWTAuth_RejectedDedicated
  (two table rows pinning the dedicated G-1 error fires regardless of
  whether Secret is set), TestValidAuthTypesDoesNotContainJWT (property
  guard against future re-introduction),
  TestValidAuthTypesIsExactly_APIKey_None (allowed-set contract),
  TestValidate_GenericInvalidAuthType (pins non-jwt invalid values still
  hit the generic invalid-auth-type error). Removed the prior
  TestValidate_JWTAuth_MissingSecret happy-path since its premise is
  inverted post-G-1.
- internal/api/handler/health_test.go: removed
  TestAuthInfo_ReturnsAuthType_JWT (which baked the silent-downgrade lie
  into the regression suite). Pre-existing _APIKey test continues to
  cover the api-key happy path.

Phase 3 — spec, docs, env templates:
- api/openapi.yaml: auth_type enum dropped to [api-key, none] with
  inline comment naming the G-1 closure.
- .env.example (root): CERTCTL_AUTH_TYPE comment block rewritten to drop
  jwt and point at the gateway pattern; secret-required conditional
  simplified to api-key-only.
- docs/architecture.md: middleware-stack bullet rewritten to drop the
  JWT mention; new H3 'Authenticating-gateway pattern (JWT, OIDC, mTLS)'
  section explaining the design rationale and listing oauth2-proxy /
  Envoy ext_authz / Traefik ForwardAuth / Pomerium / Authelia / Caddy
  forward_auth / Apache mod_auth_openidc / nginx auth_request as the
  standard fronting options.
- docs/upgrade-to-v2-jwt-removal.md (new ~125 lines): migration guide
  with preconditions, what-changes, both recovery paths, complete
  docker-compose oauth2-proxy walkthrough, Traefik ForwardAuth and Envoy
  ext_authz patterns, rollback posture.

Phase 4 — Helm chart (template validation + docs):
- deploy/helm/certctl/templates/_helpers.tpl: new certctl.validateAuthType
  helper mirroring the existing certctl.tls.required pattern. Fails
  template render on any server.auth.type outside {api-key, none} with
  a multi-line diagnostic.
- deploy/helm/certctl/templates/server-deployment.yaml,
  server-configmap.yaml, server-secret.yaml: invoke the helper at the
  top of each template that depends on .Values.server.auth.type.
- deploy/helm/certctl/values.yaml: auth: block comment expanded with the
  G-1 rationale and gateway-pattern cross-reference.
- deploy/helm/CHART_SUMMARY.md: server.auth.type table row now surfaces
  the allowed set and points at the upgrade doc.
- deploy/helm/certctl/README.md: new 'JWT / OIDC via authenticating
  gateway' section with a Kubernetes-flavored oauth2-proxy + certctl
  walkthrough.

Phase 5 — release surface:
- CHANGELOG.md: new [unreleased] top entry with Breaking / Removed /
  Added / Changed sections; explicit pointer at
  docs/upgrade-to-v2-jwt-removal.md from the Breaking subsection.

Phase 6 — CI guardrail:
- .github/workflows/ci.yml: new 'Forbidden auth-type literal regression
  guard (G-1)' step. Scoped patterns catch the actual regression shapes
  (map literal, slice literal, switch case, OpenAPI enum, env-file
  default, AuthType('jwt') cast). Comments and the dedicated rejection
  branch are intentionally exempt; connector-package JWT references
  (Google OAuth2 / step-ca) are exempt as out-of-scope external
  protocols. Verified locally: the guard passes on the actual tree and
  fires on all 4 synthetic regression patterns.

Out of scope (explicitly untouched):
- internal/connector/discovery/gcpsm/gcpsm.go — Google OAuth2 service-
  account JWT (external protocol).
- internal/connector/issuer/googlecas/googlecas.go — same.
- internal/connector/issuer/stepca/stepca.go — step-ca's provisioner
  one-time-token JWT for /sign API.
- docs/test-env.md, docs/connectors.md, docs/features.md — describe
  external CAs' use of JWT, not certctl's auth shape.
- Implementing actual JWT middleware. Feature, not a fix.

Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go test -short ./... — every package green
- go test -short -race ./internal/config/... ./internal/api/... — clean
- govulncheck ./... — no vulnerabilities in our code
- helm lint deploy/helm/certctl/ — clean
- helm template with auth.type=api-key — renders OK
- helm template with auth.type=none — renders OK
- helm template with auth.type=jwt — fails with validateAuthType
  diagnostic (exit 1)
- python3 yaml.safe_load on api/openapi.yaml — parses
- CI guardrail mirror — clean on real tree, fires on all 4 synthetic
  regression patterns
- Smoke test: 'CERTCTL_AUTH_TYPE=jwt ./certctl-server' exits non-zero
  with: 'Failed to load configuration: CERTCTL_AUTH_TYPE=jwt is no
  longer accepted (G-1 silent auth downgrade): no JWT middleware ships
  with certctl. To use JWT/OIDC, run an authenticating gateway
  (oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium) in
  front of certctl and set CERTCTL_AUTH_TYPE=none on the upstream.
  See docs/architecture.md "Authenticating-gateway pattern" and
  docs/upgrade-to-v2-jwt-removal.md for the migration walkthrough'

config pkg coverage: ValidAuthTypes 100%, Validate 94.7%, total 75.5%.

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-g-jwt_silent_auth_downgrade
      Audit recommendation followed verbatim: 'Remove jwt from
      validAuthTypes until middleware ships'.
2026-04-25 00:22:23 +00:00
shankar0123 3192cd15c5 Merge branch 'fix/u1-followups-helm-rootenv-examples' 2026-04-24 23:51:18 +00:00
shankar0123 af47d19ae2 fix(deploy,examples,env): close U-1 trap end-to-end across Helm, examples, and root env
Follow-up to cfc234e (U-1 docker-compose fix) — closes the remaining adjacent
code paths that share the postgres-first-boot-password-binding root cause but
were scoped out of the original commit.

The runtime diagnostic in internal/repository/postgres/db.go::wrapPingError
(landed in a911970) already covers every NewDB call site, so Helm operators
and example users hit the SQLSTATE 28P01 guidance for free at startup. What
was missing: deployment-shape-specific remediation guidance (kubectl vs
docker-compose), the hardcoded password in the *root* .env.example, and
shared ops notes for the 5 examples/ compose files. This commit closes all
three.

Files changed:

- .env.example (root) — line 16 had `postgres://certctl:certctl@...` with
  the password hardcoded literally instead of interpolating POSTGRES_PASSWORD.
  Edit if a user copied this file as their .env (binary-direct deployment,
  not docker-compose) and rotated POSTGRES_PASSWORD on line 10, the URL on
  line 16 still carried 'certctl' — silent two-line drift. Replaced 'certctl'
  with the same default that line 10 carries ('change-me-in-production') and
  added an explanatory comment block describing the docker-compose
  override semantics, when this URL matters (binary-direct), and the
  cross-reference to the U-1 wrapPingError diagnostic. Also fixed an
  adjacent bug: line 31 CERTCTL_SERVER_URL was `http://localhost:8443`,
  which agents reject at startup since v2.2 (HTTPS-everywhere milestone made
  the control plane HTTPS-only with TLS 1.3 pinned). Updated to https://
  with a comment pointing operators at the bootstrap CA bundle.

- deploy/helm/certctl/values.yaml — postgresql.auth.password field had a
  one-line 'REQUIRED' comment. Expanded into a full WARNING block (~25
  lines) explaining the PVC retention semantics, the failure symptom,
  and both kubectl-flavored remediation paths: non-destructive
  (`kubectl exec ... ALTER ROLE`) preferred for environments with data,
  and destructive (`helm uninstall + kubectl delete pvc`) for dev/demo.
  Cross-references the wrapPingError runtime diagnostic.

- deploy/helm/certctl/README.md (new, ~115 lines) — chart-level operational
  guide. Covers quick install, both remediation paths with concrete
  kubectl commands, why-we-don't-fix-this-in-the-chart explanation,
  cross-references to the docker-compose docs, server API key rotation
  (the easy case — comma-separated key list), TLS provisioning shapes,
  embedded-vs-external postgres, and uninstall semantics with the PVC
  retention gotcha called out.

- examples/README.md (new, ~55 lines) — shared operational notes for the
  5 example deployments. Covers the postgres password rotation trap with
  example-flavored remediation paths (`docker compose -f examples/<x>/...`),
  the TLS warning, and teardown semantics. Replaces what would otherwise
  be 5x duplication across per-example READMEs.

- examples/{acme-nginx,acme-wildcard-dns01,multi-issuer,private-ca-traefik,
  step-ca-haproxy}/*.md — one-line cross-reference at the top of each
  example's primary doc, pointing at examples/README.md for the shared
  ops notes. Avoids 5x duplication of the same warning text while still
  surfacing the link in every operator's first-touch surface.

Verification:

- go build ./... — clean
- go vet ./... — clean
- go test -short ./internal/repository/postgres/ — 4/4 wrapPingError tests
  still passing (no production-code touch in this commit)
- helm lint deploy/helm/certctl/ — clean (1 INFO about chart icon, pre-existing)
- helm template smoke test — renders without error
- python3 yaml.safe_load on values.yaml — parses

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap
      Closes the three deliberate scope-outs from cfc234e (Helm,
      root .env.example, examples/) end-to-end.

      Adjacent bugs caught while in scope:
      - root .env.example:16 hardcoded password not matching line 10
      - root .env.example:31 http:// URL incompatible with HTTPS-only v2.2
2026-04-24 23:51:13 +00:00
shankar0123 cfc234ec42 Merge branch 'fix/u1-postgres-password-volume-trap-diagnostic' 2026-04-24 23:21:33 +00:00
shankar0123 a91197014f fix(db): emit volume-state guidance on postgres auth failure (U-1, #10)
The shipped quickstart instructs operators to copy deploy/.env.example to
deploy/.env, edit POSTGRES_PASSWORD, and run docker compose up. On the
*first* boot of a fresh checkout this works. On the *second* boot — i.e.,
when an operator first booted with the default POSTGRES_PASSWORD=certctl,
then edited .env and re-ran up — the certctl-server container picks up the
new password (env interpolated at every container start) but postgres does
not. The postgres docker-entrypoint runs initdb only when the data dir is
empty; on subsequent boots the persistent named volume postgres_data is
non-empty so pg_authid retains the password baked in on first boot. The
server connects with the new credentials, postgres rejects them, and the
operator sees an opaque `pq: password authentication failed for user
"certctl"` in the server log with no pointer to the actual cause. New-
operator onboarding gets blocked on the documented production path.

Why a doc fix alone is not sufficient. Operators don't reread the docs
after a successful first boot — the trap fires on the *second* up, when
they think they've already learned the system. The opaque pq error is
indistinguishable in the log from a typo'd password or a misconfigured
secret store. The diagnostic has to fire at the moment the failure is
observed.

Why we don't try to fix the bootstrap. The env-vs-pg_authid divergence is
intrinsic to how the official postgres image bootstraps (see
docker-entrypoint.sh: initdb runs only if PGDATA is empty). Switching to a
bind mount or ephemeral volume breaks the production path; switching to
POSTGRES_PASSWORD_FILE + ALTER ROLE adds operator surface without
eliminating the divergence. The ergonomic fix is to surface the failure
mode loudly, with both remediation paths, at the exact log line where it
becomes visible.

Two remediation paths, surfaced together. Destructive: `docker compose
-f deploy/docker-compose.yml down -v && up -d --build` — wipes the
postgres volume so initdb re-runs with the new env value. Use this on
demos / first-time setup where data loss is acceptable. Non-destructive:
`docker compose exec postgres psql -U certctl -c "ALTER ROLE certctl
PASSWORD '<new>';"` followed by a server restart with the matching
POSTGRES_PASSWORD. Use this on any environment that holds data you want
to keep. Surfacing both means the operator can pick based on their
environment without us assuming.

Files changed:

- internal/repository/postgres/db.go — extract wrapPingError(err) helper.
  errors.As against *pq.Error; on SQLSTATE 28P01 (invalid_password) emit
  the multi-line guidance preserving the %w wrap chain. Non-28P01 errors
  retain the original `failed to ping database: %w` shape so transient
  connection-refused / timeout paths don't get noisy. Add
  pgErrInvalidPassword = "28P01" constant. Convert blank
  `_ "github.com/lib/pq"` import to direct import (driver registration
  still works via init()) so we can name the *pq.Error type at compile
  time. NewDB now calls wrapPingError(err) instead of inlining the wrap.
- internal/repository/postgres/db_test.go (new) — 4 internal-package
  unit tests covering wrapPingError. AuthFailureGuidance pins the
  contract substrings ("SQLSTATE 28P01", "POSTGRES_PASSWORD",
  "first boot", "down -v", "ALTER ROLE"). NonAuthErrorPreservesOriginalWrap
  pins the no-leak contract for SQLSTATE 08006 (connection_failure).
  NonPqErrorPreservesOriginalWrap pins the network-level path.
  NilReturnsNil pins defensive contract. All run in -short without
  testcontainers — package postgres (internal) so the unexported helper
  is callable directly.
- docs/quickstart.md — `> **Warning:**` callout immediately after the
  `cp deploy/.env.example deploy/.env` block at lines 56-61. Names the
  trap, names the SQLSTATE, gives both remediation paths. Uses the
  in-file `> **Note:**` blockquote convention.
- deploy/ENVIRONMENTS.md — `**Stateful volume — first-boot password
  binding (U-1)**` paragraph appended to the Postgres expert-note block.
  Explains the env-vs-pg_authid divergence, points at wrapPingError as
  the runtime diagnostic, lists both remediation paths. Uses the in-file
  `**Expert note:**` convention.

Out of scope (separate follow-ups):

- deploy/helm/certctl/templates/postgres-statefulset.yaml has the same
  root cause via PVC retention. The wrapPingError diagnostic covers the
  Helm path because the same NewDB code runs at server startup; the
  Helm-specific doc warning lands separately.
- /.env.example at repo root (line 16 hardcodes the password literally
  inside CERTCTL_DATABASE_URL rather than interpolating) — adjacent
  trap, separate fix.
- examples/{acme-nginx,private-ca-traefik,step-ca-haproxy,multi-issuer,
  acme-wildcard-dns01}/docker-compose.yml all carry the pattern. The
  diagnostic covers them; targeted doc warnings are scoped to the
  canonical quickstart + ENVIRONMENTS docs.

Out of consideration:

- Switch to bind mount / ephemeral volume — breaks the production path.
- POSTGRES_PASSWORD_FILE + Docker secret + ALTER ROLE rotation — adds
  operator surface without fixing the env-vs-pg_authid divergence.

Verification (all passing):
- go build ./...
- go vet ./...
- go test -short -race ./internal/repository/postgres/ — 4/4 new tests
  pass plus existing tests
- go test -short ./... — every package green
- govulncheck ./... — no vulnerabilities in our code
- wrapPingError coverage 100%; postgres pkg total unchanged in shape
  (NewDB/RunMigrations were 0% pre-fix, still 0% post-fix; new helper
  adds 100%-covered statements)

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap
      GitHub Issue #10 (mikeakasully)
2026-04-24 23:21:26 +00:00
shankar0123 d6959a75c1 Merge branch 'test/l1-repo-integration-coverage' v2.0.49 2026-04-20 20:39:10 +00:00
shankar0123 97b23e98d9 test(repository): close L-1 integration-coverage gap for HealthCheck + RenewalPolicy
The coverage-gap audit flagged L-1 (P2): `HealthCheckRepository` (453 LOC,
11 methods) and `RenewalPolicyRepository` (289 LOC, 5 methods post-G-1 —
the audit's "92 lines, 2 methods" figure was stale) ship to production
with zero live-DB integration coverage. The existing `repo_test.go`
header self-documents the gap: "15 of 17 PostgreSQL repository files".

Operationally load-bearing piece: M48's scheduler calls
`HealthCheckRepository.ListDueForCheck` every tick to drive continuous
TLS health monitoring. A silent SQL regression there — wrong INTERVAL
math, NULL-handling slip, lost ORDER BY — would fail open: operator
adds endpoint → scheduler never picks it up → endpoint degrades in
production → no alert. The loop continues ticking and logs "processed
0 endpoints" normally, so the failure mode is operationally invisible.

Closure shape (test-only; no production code touched):

- internal/repository/postgres/health_check_test.go (new file, 7 tests)
  · TestHealthCheckRepository_CRUD
  · TestHealthCheckRepository_GetByEndpoint
  · TestHealthCheckRepository_List_Filters
  · TestHealthCheckRepository_ListDueForCheck  (the load-bearing one —
    seeds four rows with differing last_checked_at+interval
    relationships to NOW() plus one NULL-last_checked_at row,
    asserts the correct subset returns and ORDER BY last_checked_at
    ASC NULLS FIRST holds)
  · TestHealthCheckRepository_RecordHistory_GetHistory
  · TestHealthCheckRepository_PurgeHistory
  · TestHealthCheckRepository_GetSummary

- internal/repository/postgres/renewal_policy_test.go (new file, 3 tests)
  · TestRenewalPolicyRepository_CRUD  (exercises auto-generated
    rp-<slug(name)> PK, JSONB round-trip of [30,14,7,0] thresholds,
    UpdatedAt monotonic advance, ORDER BY name for List)
  · TestRenewalPolicyRepository_DuplicateName  (asserts
    errors.Is(err, repository.ErrRenewalPolicyDuplicateName) on both
    Create-name-unique and Update-name-unique collision paths, the pg
    23505 sentinel mapping)
  · TestRenewalPolicyRepository_DeleteInUse  (raw-INSERTs a
    managed_certificates row FK'ing the policy, asserts
    errors.Is(err, repository.ErrRenewalPolicyInUse) from pg 23503
    ON DELETE RESTRICT, cleans up, then asserts not-found surfaces
    distinctly)

- internal/repository/postgres/repo_test.go (one-line header flip)
  "covering 15 of 17 ... repository files" → "17 of 17"; added
  cross-reference pointing readers at the two sibling files.

Both new files use the existing getTestDB(t) + schema-per-test-isolation
convention and skip via testing.Short() in CI, matching M26 TICKET-003
scaffolding byte-for-byte. Repository/postgres is not in the CI
coverage-gate path (grep -nE "internal/repository/postgres"
.github/workflows/ci.yml → no hits), so adding test-only files cannot
regress gated coverage elsewhere.

Verification gates run locally (sandbox without Docker, so the -short
skip gate itself is what's exercised; operator runs the testcontainer
path locally):

  1.  go vet ./...                                              — clean
  2.  go build ./...                                            — clean
  3.  go test -short -count=1 ./...                             — clean
  4.  go test -race -short ./internal/repository/postgres/...   — clean
  5.  staticcheck                         — absent; CI checkset holds
  6.  govulncheck                         — skipped; test-only, no deps
  7.  per-layer coverage no-regression    — N/A; repo/pg not gated
  8.  tsc --noEmit                        — N/A; no frontend change
  9.  vitest run                          — N/A; no frontend change
  10. vite build                          — N/A; no frontend change
  11. OpenAPI lint                        — N/A; no spec change

No migration, no interface change, no production code diff. The
RenewalPolicyRepository drift between audit ("92 lines, 2 methods")
and HEAD (289 lines, 5 methods post-G-1) is documented honestly in
the audit report's Resolution Log, not papered over.

Closes: coverage-gap-audit L-1 (P2)
2026-04-20 20:39:06 +00:00
shankar0123 4cf5fcdb4f Merge branch 'fix/d1-cli-status-endpoint' 2026-04-20 19:41:03 +00:00
shankar0123 1ee67b7792 D-1: correct certctl-cli status endpoint path (/api/v1/health -> /health)
The CLI's GetStatus() was issuing GET /api/v1/health, but the real
liveness route is GET /health at internal/api/router/router.go:76
(mounted at root, not under /api/v1/). Every 'certctl-cli status'
invocation 404'd since M16b.

The regression was masked because TestClient_GetStatus encoded the
same wrong path on both sides of the contract -- the mock server
also dispatched on /api/v1/health -- so the production request
matched the test's buggy dispatch and the green bar hid the bug.

Two-line fix:
  - internal/cli/client.go:615: "/api/v1/health" -> "/health"
  - internal/cli/client_test.go:296: mock dispatch to match

Red receipt captured before the green fix: with the test fixture
corrected but production still wrong, TestClient_GetStatus fails
'parsing response: unexpected end of JSON input' (the client falls
through the mock's if/else to the default 200 OK empty body and
the JSON decoder chokes). After the production edit the test
passes.

GetStatus()'s response decoder is already compatible with the real
/health shape (graceful 'ok' check on health["status"], optional
health["timestamp"]). No interface change. No migration. No
frontend change. No OpenAPI delta -- /health is a root-level
liveness probe, not part of the /api/v1/ surface.
2026-04-20 19:40:58 +00:00
shankar0123 128d0eeaa8 Merge branch 'fix/g1-renewal-policies-api'
G-1: renewal-policies API + frontend FK-drift fix. Adds /api/v1/renewal-policies
CRUD backing the dropdown that managed_certificates.renewal_policy_id FKs into.
Three frontend call sites swapped from getPolicies() (pol-*, compliance rules)
to getRenewalPolicies() (rp-*, lifecycle policies). Validation bounds, pg
23503/23505 error mapping to HTTP 409, OpenAPI coverage, test suite.

No migration — renewal_policies table already exists from schema 000001.
2026-04-20 18:53:09 +00:00
shankar0123 9834b4e4a4 G-1: renewal-policies API + frontend FK-drift fix
Three frontend call sites (OnboardingWizard.tsx:603, CertificatesPage.tsx:52,
CertificateDetailPage.tsx:169) populated the renewal_policy_id dropdown from
getPolicies() — the compliance-rule endpoint returning pol-* IDs — which
violated the FK managed_certificates.renewal_policy_id REFERENCES
renewal_policies(id) ON DELETE RESTRICT. Create would fail pg 23503 at insert.

Backend (new):
- RenewalPolicyRepository CRUD + ListAll/ExistsByID (pg 23503 → ErrRenewalPolicyInUse
  → HTTP 409; pg 23505 → ErrRenewalPolicyDuplicateName → HTTP 409)
- RenewalPolicyService with repo-only constructor. Service sentinels
  var-alias the repo sentinels so errors.Is walks across layers.
- RenewalPolicyHandler with validation bounds: name 1–255;
  renewal_window_days [1,365] default 30; max_retries [0,10] not defaulted;
  retry_interval_seconds [60,86400] default 3600; alert_thresholds_days
  [0,365] default [30,14,7,0]. Auto-generated IDs rp-<slug(name)>.
- Router registers 5 routes under /api/v1/renewal-policies[/{id}].

Frontend:
- CertificatesPage/CertificateDetailPage/OnboardingWizard now call
  getRenewalPolicies() and render rp-* IDs.
- client.ts adds getRenewalPolicies/createRenewalPolicy/updateRenewalPolicy/
  deleteRenewalPolicy. types.ts adds the RenewalPolicy shape.

OpenAPI: RenewalPolicies tag + 5 operations + 3 schemas (RenewalPolicy,
RenewalPolicyCreateRequest, RenewalPolicyUpdateRequest). 409 responses
on create/update duplicate-name and delete FK-in-use.

No migration — renewal_policies table already exists from the initial
schema (000001).

Tests:
- internal/service/renewal_policy_test.go: CRUD + validation + sentinel
  error wrapping.
- internal/api/handler/renewal_policy_handler_test.go: handler endpoint
  contracts including 400/404/409.
- web/src/api/client.test.ts: 4 subtests covering the 4 new API functions.

Phase 3 gates all green: go vet, build, short tests, race tests (service/
handler/router/scheduler), staticcheck (G-1 packages), govulncheck (0
reachable), coverage (service 69.7%, handler 79.0%, domain 86.9%,
middleware 80.6% — all above thresholds), tsc, vitest (256 passed),
vite build, OpenAPI structural validation.
2026-04-20 18:53:01 +00:00
shankar0123 cab579368b Merge branch 'fix/audit-f001-f002-f003'
Closes F-001 (CRL scoped query via composite index), F-002 (digest error
body sanitization), and F-003 (ctx-aware sleep at three sites).

Verification: build, vet, race-short test sweep across all packages green.
govulncheck clean. golangci-lint run deferred — local environment's
golangci-lint is v1.64.8 built with go1.24 and rejects the go1.25.9
project; fresh install blocked by disk constraints. CI lane will cover it.
2026-04-20 16:52:00 +00:00
shankar0123 4e5522a999 F-001/F-002/F-003: CRL prefix-scan, digest error sanitization, ctx-aware sleeps
F-001 (P3): GenerateDERCRL scoped to issuer via composite index
  - Add RevocationRepository.ListByIssuer leveraging migration 000012's
    idx_certificate_revocations_issuer_serial composite index as a
    prefix-scan target. Previously CAOperationsSvc.GenerateDERCRL called
    ListAll() and filtered by IssuerID in Go — O(total revocations)
    regardless of how many revocations belonged to the target issuer.
  - Rewrite GenerateDERCRL to call ListByIssuer(ctx, issuerID) so PostgreSQL
    drives a prefix scan of the composite index. Drops the in-memory filter.
  - New regression test in ca_operations_test.go asserts the CRL hot path
    invokes ListByIssuer exactly once and never ListAll, and that the
    issuerID is threaded through correctly.

F-002 (P3): digest.go admin-auth endpoints no longer leak internal errors
  - PreviewDigest (GET /api/v1/digest/preview) and SendDigest
    (POST /api/v1/digest/send) previously wrote err.Error() into the HTTP
    response body on 500s. Replace with slog.Error server-side logging plus
    a generic "internal error" response body, matching the house pattern
    in certificates.go and export.go.

F-003 (P4): three blocking time.Sleep sites now honor ctx cancellation
  - internal/connector/issuer/acme/acme.go:672 (DNS-01 propagation wait)
    now runs under a select{case <-ctx.Done(): CleanUp + return ctx.Err();
    case <-time.After(d):} so graceful shutdown doesn't get stuck behind
    the propagation delay.
  - internal/connector/issuer/acme/acme.go:786 (dns-persist-01 propagation
    wait) same pattern, returns ctx.Err() on cancel.
  - cmd/agent/main.go:272 (polling backoff inside the heartbeat loop) now
    wraps the sleep in select{case <-ctx.Done(): continue; case <-time.After(backoff):}
    so the outer <-ctx.Done() case on the parent loop fires cleanly.

Verification: build, vet, and race-enabled short tests green across all
55+ packages. govulncheck reports zero vulnerabilities in the code path.
No migration needed — F-001 reuses the existing 000012 composite index.
No frontend changes.
2026-04-20 16:51:52 +00:00
shankar0123 55ce86b132 v2.0.48: swap self-signed TLS bootstrap algorithm ed25519 → ECDSA-P256
Follow-up to v2.0.47 (HTTPS-Everywhere). The Phase-3 self-signed
bootstrap sidecar shipped an ed25519 server cert. Apple's TLS stack —
Safari Network Framework and the macOS-bundled LibreSSL 3.3.6
/usr/bin/curl — does not advertise ed25519 in the ClientHello
signature_algorithms extension for server certs, so the handshake fails
with the server-side log line:

  tls: peer doesn't support any of the certificate's signature algorithms

Homebrew OpenSSL 3.x, Chrome, Firefox, and Linux curl all accept
ed25519 server certs fine. Apple is the outlier. Rather than gate the
demo stack behind "install Homebrew OpenSSL first," swap the bootstrap
algorithm to ECDSA-P256 with SHA-256 — universally supported, including
on the Apple stack.

Changes
- deploy/docker-compose.yml: certctl-tls-init openssl invocation swapped
  to `-newkey ec -pkeyopt ec_paramgen_curve:P-256 -nodes`; header comment
  + echo line updated; multi-line rationale paragraph added.
- deploy/docker-compose.test.yml: same openssl swap + echo update for
  the test harness sidecar that writes to the bind-mounted ./test/certs
  directory the Go integration_test.go pins via CERTCTL_TEST_CA_BUNDLE.
- docs/tls.md: Pattern 1 description + code block updated;
  "Why ECDSA-P256 and not ed25519" rationale paragraph added covering
  pre-v2.0.48 history, the Apple diagnosis, accepting clients, and
  the operator migration command. Patterns 2 (existing Secret) and 3
  (cert-manager) explicitly called out as unaffected.
- docs/upgrade-to-tls.md: docker-compose procedure sentence updated
  with cross-reference to tls.md Pattern 1.
- docs/test-env.md: "Get the CA bundle for curl" sentence updated.

Migration
Existing demo installs must tear the `certs` named volume down to pick
up the new algorithm:

  docker compose -f deploy/docker-compose.yml down -v
  docker compose -f deploy/docker-compose.yml up -d --build

Not touched
- cmd/server/tls.go: algorithm-agnostic. TLS 1.3 min version with
  [X25519, P-256] curve preferences for key exchange is orthogonal to
  the server cert's signature algorithm. No Go code change needed.
- Helm chart: Patterns 2 and 3 operators supply their own cert; this
  patch does not affect them.
- Unrelated ed25519 uses (agent key algorithm detection, profile
  algorithm options, SSH key path examples, tlsprobe key metadata,
  cloud discovery key-algo display): all orthogonal to the server TLS
  bootstrap cert.

Incidental cleanup
- .gitignore: dropped dangling `strategy.md` entry (file doesn't exist
  in repo; entry was cruft).
v2.0.48
2026-04-20 04:17:05 +00:00
shankar0123 52248be717 v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP
Breaking change release. Plaintext HTTP listener removed. The certctl
control plane now terminates TLS 1.3 on :8443 via
http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape
hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md.

Server
- cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert
  swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback),
  preflightServerTLS validation
- cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe,
  watchSIGHUP wiring, cert/key path config threading
- tls_test.go: 418-line regression coverage of reload, preflight,
  callback behavior, SAN validation

Config
- CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required)
- Plaintext rejection: agents/CLI/MCP pre-flight-fail on http://
  URLs with a pointer to docs/upgrade-to-tls.md

Agents, CLI, MCP
- All three pre-flight-reject http:// URLs with fail-loud diagnostic
- CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust
- CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass
  (loud warning on startup)
- install-agent.sh emits both vars as commented template lines

docker-compose
- certctl-tls-init sidecar generates SAN-valid self-signed cert into
  deploy/test/certs/ on first boot
- All demo-stack curls pin against ca.crt with --cacert

Helm chart
- Three TLS provisioning modes, exactly one required:
  - server.tls.existingSecret (operator-supplied)
  - server.tls.certManager.enabled (cert-manager integration)
  - server.tls.selfSigned.enabled (eval only — not for production)
- server-certificate.yaml template for cert-manager mode
- helm install without a TLS source fails at template render with
  a pointer to docs/tls.md

CI
- .github/workflows/ci.yml Helm Chart Validation step renders the
  chart in both existingSecret and cert-manager modes, plus an
  inverse guard-regression test that asserts helm template MUST
  refuse to render when no TLS source is configured. Previously
  the single `helm template` invocation hit the certctl.tls.required
  fail-loud guard and exit-1'd CI. Four invocations now: lint
  (existingSecret), template (existingSecret), template
  (cert-manager), template (no args — must fail).

Integration tests
- deploy/test/integration_test.go stands up the Compose stack over
  HTTPS, extracts the CA bundle, and exercises every certctl API
  over https://localhost:8443
- All 34 integration subtests green (per Phase 8 local CI-parity)

Documentation
- New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload)
- New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade
  warnings, fleet-roll sequencing)
- CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry
  (file heading unchanged; release tag is v2.0.47)
- All curls in docs/, examples/, deploy/helm/ guides use
  https://localhost:8443 --cacert

Verification
- grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits
- grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin
  API default, SSRF doc comment) — zero certctl endpoints
- Tasks #197–#206 (Phases 0–8) all closed in the tracker

Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix).
v2.0.47
2026-04-20 03:43:10 +00:00
shankar0123 04c7eca615 docs: reconcile scheduler topology across sibling docs (7 → 12 loops)
Authoritative 12-loop table lives at docs/architecture.md:522-534 (committed via
the I-001/I-003/I-005 + M48/M50 milestone commits). This change brings six sibling
docs into parity with that table so every surface — user-facing features reference,
SOC 2 compliance mapping, connectors guide, advanced demo architecture diagram,
testing guide, and in-line architecture prose — reflects the same 8 always-on + 4
opt-in topology.

Touches:
- docs/architecture.md: 2 inline ordinal references (9th / 8th loop) replaced with
  descriptive names (opt-in cloud discovery / opt-in endpoint health), cross-linked
  to the authoritative table to prevent future ordinal rot.
- docs/features.md: metric row (7 → 12), inline reference to 9th loop, and full
  scheduler table expanded to include Always-on column + env vars + I-001/I-003/I-005
  refs.
- docs/compliance-soc2.md: background scheduler monitoring bullets expanded to list
  all 12 loops with env vars + I-series refs; table row updated with 8 always-on +
  4 opt-in summary.
- docs/connectors.md: three inline ordinals (7th/6th/9th loop) replaced with
  descriptive names, cross-linked to architecture.md.
- docs/demo-advanced.md: Mermaid SCHED node label updated from '7 background loops'
  to '12 background loops (8 always-on + 4 opt-in)'.
- docs/testing-guide.md: Test 20.1.1 header + grep pattern expanded to include
  job-retry / job-timeout / notification-retry / digest / endpoint-health /
  cloud-discovery loops; sign-off chart row label updated.

Pure documentation reconciliation. No code changes. Master HEAD pre-commit: 6e646e0.
2026-04-20 02:51:34 +00:00
shankar0123 6e646e0fe8 M-001/M-006: strip HTTP auth from EST/SCEP + fail-loud SCEP preflight
Closes CWE-306 (missing authentication for critical function) for SCEP
via a fail-loud startup gate, and aligns EST/SCEP HTTP dispatch with
their respective RFCs. CRL/OCSP remain unauthenticated under
.well-known/pki/* per RFC 5280 §5 / RFC 6960 / RFC 8615. Option (D):
no mTLS in this milestone.

- RFC 7030 §3.2.3 (EST auth is deployment-specific) and §4.1.1
  (/cacerts explicitly anonymous): EST paths served unauthenticated;
  CSR-signature + profile policy enforce identity inside ESTService.
- RFC 8894 §3.2: SCEP authenticates via the challengePassword
  PKCS#10 attribute (OID 1.2.840.113549.1.9.7), not an HTTP credential.
  HTTP dispatch is unauthenticated; preflightSCEPChallengePassword
  refuses to start when CERTCTL_SCEP_ENABLED=true without
  CERTCTL_SCEP_CHALLENGE_PASSWORD. SCEPService.PKCSReq enforces the
  same invariant defense-in-depth and compares with
  crypto/subtle.ConstantTimeCompare.

cmd/server/main.go:
- Extract buildFinalHandler(apiHandler, noAuthHandler, webDir,
  dashboardEnabled); route /.well-known/est/*, /scep, /scep/*,
  /.well-known/pki/crl/{id}, /.well-known/pki/ocsp/{id}/{serial},
  and health probes through noAuthHandler (RequestID +
  structuredLogger + Recovery only).
- Add preflightSCEPChallengePassword fail-loud gate; startup log
  emits challenge_password_set boolean for operator visibility.

cmd/server/finalhandler_test.go (new, 314 lines, 27 subtests):
- TestBuildFinalHandler_Dispatch (20) + TestBuildFinalHandler_NoDashboard
  (7) pin the dispatch surface: EST 4-endpoint, SCEP exact +
  trailing-slash + query-string, PKI CRL+OCSP, health, /api/v1/*
  authenticated, /assets/* file server, SPA fallback.

internal/api/router/router.go, internal/config/config.go:
- Router-level comments explain why EST/SCEP/PKI dispatchers sit
  outside the authenticated mux; SCEP challenge password config
  plumbed through.

docs/architecture.md:
- New EST Authentication subsection (RFC 7030 §3.2.3 + §4.1.1,
  buildFinalHandler + noAuthHandler references).
- Rewrite SCEP Authentication subsection; replaces pre-existing
  factually-incorrect "any value accepted" claim with CWE-306
  preflight, service-layer defense-in-depth, and
  crypto/subtle.ConstantTimeCompare.
- Top-level Authentication section: qualify /api/v1/* scope on API
  clients bullet; add standards-based-endpoints bullet referencing
  the 27-subtest regression harness.

docs/compliance-soc2.md:
- CC6.1: scope API Key Authentication to /api/v1/*; add
  standards-based endpoints bullet citing RFCs and CWE-306 closure.
- CC6.3: scope API Key Policy to /api/v1/* with cross-reference to
  CC6.1.
- Evidence Locations augmented with buildFinalHandler,
  preflightSCEPChallengePassword, scep.go defense path, regression
  harness, and OpenAPI security:[] overrides.

api/openapi.yaml: verified already correct (global bearerAuth
default overridden with security:[] on /cacerts, /simpleenroll,
/simplereenroll, /csrattrs, /scep GET+POST, /crl/{issuer_id},
/ocsp/{issuer_id}/{serial}); no edits needed.
2026-04-19 17:20:05 +00:00
shankar0123 675b87ba63 I-005: notification retry loop + dead-letter queue
Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.

Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
  (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
  Dead rows clear next_retry_at so the index stops matching them.

Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
  exponential backoff capped at 1h (notifRetryBackoffCap) with
  5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
  status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
  (audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
  wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
  retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
  status -> pending via notifRepo.Requeue.

Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
  CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.

StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
  NewStatsService call sites (main.go + stats_test.go + 8 digest
  tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
  notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
  (reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
  best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
  the same snapshot.

Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.

Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
  with queryKey: ['notifications', activeTab] so switching tabs
  doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
  full-text title tooltip.
- Requeue mutation wrapped as
    mutationFn: (id: string) => requeueNotification(id)
  to prevent react-query v5's positional context argument from
  leaking into the API client — pinned against future refactors
  by strict-match toHaveBeenCalledWith('notif-dead-001') in
  NotificationsPage.test.tsx:181.

Closes I-005.
2026-04-19 15:17:27 +00:00
shankar0123 707d8de4fb UX-001: sidebar re-entry + inline team/owner creation in wizard
Closes UX-001 (OnboardingWizard CertificateStep dead-end): users no
longer have to navigate away from the wizard and lose their in-flight
state when the required Owner/Team dropdowns are empty.

Layout.tsx
  - Adds persistent 'Setup guide' button in the left sidebar.
  - Clears localStorage 'certctl:onboarding-dismissed' then navigates
    to /?onboarding=1 as a re-entry signal that overrides dismissal.
  - localStorage.removeItem wrapped in try/catch to tolerate storage
    access errors (private browsing, quota, etc.).

DashboardPage.tsx
  - Reads ?onboarding=1 via useSearchParams as a forceOnboarding flag.
  - forceOnboarding bypasses the latched first-run gate so the wizard
    reopens even after dismissal or with certs/issuers already present.
  - onDismiss now also strips ?onboarding=1 via setSearchParams(next,
    { replace: true }) so a page refresh does not relaunch the wizard.

OnboardingWizard.tsx
  - Adds CreateTeamModalInline and CreateOwnerModalInline inside
    CertificateStep. Both wire through React Query: createTeam /
    createOwner mutation on success invalidates ['teams'] / ['owners']
    and calls onCreated(id) so the parent select auto-selects the new
    row as soon as the refetch lands.
  - '+ New team' and '+ New owner' buttons placed next to the select
    labels; empty-state copy replaced with inline 'create one now'
    buttons (no more Link back to /owners /teams).
  - CreateOwner coerces empty teamId to undefined before mutation so
    the server contract matches OwnersPage.

Tests (12 new, all green; total suite 252 passed / 0 failed):
  - Layout.test.tsx (4): Setup guide button renders, clicking it clears
    the dismissal key and navigates to /?onboarding=1, tolerates
    localStorage.removeItem throwing.
  - DashboardPage.test.tsx (4): first-run auto-open, ?onboarding=1
    re-entry after dismissal, onDismiss writes localStorage + strips
    the query param, dismissed-with-no-param stays closed.
  - OnboardingWizard.test.tsx (4): Skip-Skip reaches CertificateStep
    with '+ New team' / '+ New owner' buttons visible; '+ New team'
    happy path with React Query invalidation + parent-select
    auto-select via option-parent traversal (label is a sibling, not
    htmlFor-linked); '+ New owner' happy path pins team_id: undefined
    coercion; Cancel abort never mutates.

Test infrastructure notes:
  - Closure-driven vi.fn().mockImplementation pattern drives the
    post-invalidation refetch: the mutation mock mutates a closure
    variable that the getTeams/getOwners mock reads, so the parent
    select's new <option> exists by the time the refetch lands.
  - Anchored regex (/^Create Team$/, /^Create Owner$/) disambiguates
    the modal submit from the '+ New team' / '+ New owner' triggers.

Verification gates (all green):
  - vitest run: 252 passed / 0 failed (8 files, 13.98s)
  - tsc --noEmit: 0 errors
  - vite build: clean production bundle (851.77 kB js / 226.81 kB gzip)

No new runtime dependencies. Frontend-only change.
v2.0.46
2026-04-19 14:49:04 +00:00