Commit Graph

132 Commits

Author SHA1 Message Date
Shankar 67f352db69 fix(db): emit volume-state guidance on postgres auth failure (U-1, #10)
The shipped quickstart instructs operators to copy deploy/.env.example to
deploy/.env, edit POSTGRES_PASSWORD, and run docker compose up. On the
*first* boot of a fresh checkout this works. On the *second* boot — i.e.,
when an operator first booted with the default POSTGRES_PASSWORD=certctl,
then edited .env and re-ran up — the certctl-server container picks up the
new password (env interpolated at every container start) but postgres does
not. The postgres docker-entrypoint runs initdb only when the data dir is
empty; on subsequent boots the persistent named volume postgres_data is
non-empty so pg_authid retains the password baked in on first boot. The
server connects with the new credentials, postgres rejects them, and the
operator sees an opaque `pq: password authentication failed for user
"certctl"` in the server log with no pointer to the actual cause. New-
operator onboarding gets blocked on the documented production path.

Why a doc fix alone is not sufficient. Operators don't reread the docs
after a successful first boot — the trap fires on the *second* up, when
they think they've already learned the system. The opaque pq error is
indistinguishable in the log from a typo'd password or a misconfigured
secret store. The diagnostic has to fire at the moment the failure is
observed.

Why we don't try to fix the bootstrap. The env-vs-pg_authid divergence is
intrinsic to how the official postgres image bootstraps (see
docker-entrypoint.sh: initdb runs only if PGDATA is empty). Switching to a
bind mount or ephemeral volume breaks the production path; switching to
POSTGRES_PASSWORD_FILE + ALTER ROLE adds operator surface without
eliminating the divergence. The ergonomic fix is to surface the failure
mode loudly, with both remediation paths, at the exact log line where it
becomes visible.

Two remediation paths, surfaced together. Destructive: `docker compose
-f deploy/docker-compose.yml down -v && up -d --build` — wipes the
postgres volume so initdb re-runs with the new env value. Use this on
demos / first-time setup where data loss is acceptable. Non-destructive:
`docker compose exec postgres psql -U certctl -c "ALTER ROLE certctl
PASSWORD '<new>';"` followed by a server restart with the matching
POSTGRES_PASSWORD. Use this on any environment that holds data you want
to keep. Surfacing both means the operator can pick based on their
environment without us assuming.

Files changed:

- internal/repository/postgres/db.go — extract wrapPingError(err) helper.
  errors.As against *pq.Error; on SQLSTATE 28P01 (invalid_password) emit
  the multi-line guidance preserving the %w wrap chain. Non-28P01 errors
  retain the original `failed to ping database: %w` shape so transient
  connection-refused / timeout paths don't get noisy. Add
  pgErrInvalidPassword = "28P01" constant. Convert blank
  `_ "github.com/lib/pq"` import to direct import (driver registration
  still works via init()) so we can name the *pq.Error type at compile
  time. NewDB now calls wrapPingError(err) instead of inlining the wrap.
- internal/repository/postgres/db_test.go (new) — 4 internal-package
  unit tests covering wrapPingError. AuthFailureGuidance pins the
  contract substrings ("SQLSTATE 28P01", "POSTGRES_PASSWORD",
  "first boot", "down -v", "ALTER ROLE"). NonAuthErrorPreservesOriginalWrap
  pins the no-leak contract for SQLSTATE 08006 (connection_failure).
  NonPqErrorPreservesOriginalWrap pins the network-level path.
  NilReturnsNil pins defensive contract. All run in -short without
  testcontainers — package postgres (internal) so the unexported helper
  is callable directly.
- docs/quickstart.md — `> **Warning:**` callout immediately after the
  `cp deploy/.env.example deploy/.env` block at lines 56-61. Names the
  trap, names the SQLSTATE, gives both remediation paths. Uses the
  in-file `> **Note:**` blockquote convention.
- deploy/ENVIRONMENTS.md — `**Stateful volume — first-boot password
  binding (U-1)**` paragraph appended to the Postgres expert-note block.
  Explains the env-vs-pg_authid divergence, points at wrapPingError as
  the runtime diagnostic, lists both remediation paths. Uses the in-file
  `**Expert note:**` convention.

Out of scope (separate follow-ups):

- deploy/helm/certctl/templates/postgres-statefulset.yaml has the same
  root cause via PVC retention. The wrapPingError diagnostic covers the
  Helm path because the same NewDB code runs at server startup; the
  Helm-specific doc warning lands separately.
- /.env.example at repo root (line 16 hardcodes the password literally
  inside CERTCTL_DATABASE_URL rather than interpolating) — adjacent
  trap, separate fix.
- examples/{acme-nginx,private-ca-traefik,step-ca-haproxy,multi-issuer,
  acme-wildcard-dns01}/docker-compose.yml all carry the pattern. The
  diagnostic covers them; targeted doc warnings are scoped to the
  canonical quickstart + ENVIRONMENTS docs.

Out of consideration:

- Switch to bind mount / ephemeral volume — breaks the production path.
- POSTGRES_PASSWORD_FILE + Docker secret + ALTER ROLE rotation — adds
  operator surface without fixing the env-vs-pg_authid divergence.

Verification (all passing):
- go build ./...
- go vet ./...
- go test -short -race ./internal/repository/postgres/ — 4/4 new tests
  pass plus existing tests
- go test -short ./... — every package green
- govulncheck ./... — no vulnerabilities in our code
- wrapPingError coverage 100%; postgres pkg total unchanged in shape
  (NewDB/RunMigrations were 0% pre-fix, still 0% post-fix; new helper
  adds 100%-covered statements)

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap
      GitHub Issue #10 (mikeakasully)
2026-04-24 23:21:26 +00:00
certctl dev 099683daba v2.0.48: swap self-signed TLS bootstrap algorithm ed25519 → ECDSA-P256
Follow-up to v2.0.47 (HTTPS-Everywhere). The Phase-3 self-signed
bootstrap sidecar shipped an ed25519 server cert. Apple's TLS stack —
Safari Network Framework and the macOS-bundled LibreSSL 3.3.6
/usr/bin/curl — does not advertise ed25519 in the ClientHello
signature_algorithms extension for server certs, so the handshake fails
with the server-side log line:

  tls: peer doesn't support any of the certificate's signature algorithms

Homebrew OpenSSL 3.x, Chrome, Firefox, and Linux curl all accept
ed25519 server certs fine. Apple is the outlier. Rather than gate the
demo stack behind "install Homebrew OpenSSL first," swap the bootstrap
algorithm to ECDSA-P256 with SHA-256 — universally supported, including
on the Apple stack.

Changes
- deploy/docker-compose.yml: certctl-tls-init openssl invocation swapped
  to `-newkey ec -pkeyopt ec_paramgen_curve:P-256 -nodes`; header comment
  + echo line updated; multi-line rationale paragraph added.
- deploy/docker-compose.test.yml: same openssl swap + echo update for
  the test harness sidecar that writes to the bind-mounted ./test/certs
  directory the Go integration_test.go pins via CERTCTL_TEST_CA_BUNDLE.
- docs/tls.md: Pattern 1 description + code block updated;
  "Why ECDSA-P256 and not ed25519" rationale paragraph added covering
  pre-v2.0.48 history, the Apple diagnosis, accepting clients, and
  the operator migration command. Patterns 2 (existing Secret) and 3
  (cert-manager) explicitly called out as unaffected.
- docs/upgrade-to-tls.md: docker-compose procedure sentence updated
  with cross-reference to tls.md Pattern 1.
- docs/test-env.md: "Get the CA bundle for curl" sentence updated.

Migration
Existing demo installs must tear the `certs` named volume down to pick
up the new algorithm:

  docker compose -f deploy/docker-compose.yml down -v
  docker compose -f deploy/docker-compose.yml up -d --build

Not touched
- cmd/server/tls.go: algorithm-agnostic. TLS 1.3 min version with
  [X25519, P-256] curve preferences for key exchange is orthogonal to
  the server cert's signature algorithm. No Go code change needed.
- Helm chart: Patterns 2 and 3 operators supply their own cert; this
  patch does not affect them.
- Unrelated ed25519 uses (agent key algorithm detection, profile
  algorithm options, SSH key path examples, tlsprobe key metadata,
  cloud discovery key-algo display): all orthogonal to the server TLS
  bootstrap cert.

Incidental cleanup
- .gitignore: dropped dangling `strategy.md` entry (file doesn't exist
  in repo; entry was cruft).
2026-04-20 04:17:05 +00:00
Shankar 3155b9475f v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP
Breaking change release. Plaintext HTTP listener removed. The certctl
control plane now terminates TLS 1.3 on :8443 via
http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape
hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md.

Server
- cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert
  swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback),
  preflightServerTLS validation
- cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe,
  watchSIGHUP wiring, cert/key path config threading
- tls_test.go: 418-line regression coverage of reload, preflight,
  callback behavior, SAN validation

Config
- CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required)
- Plaintext rejection: agents/CLI/MCP pre-flight-fail on http://
  URLs with a pointer to docs/upgrade-to-tls.md

Agents, CLI, MCP
- All three pre-flight-reject http:// URLs with fail-loud diagnostic
- CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust
- CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass
  (loud warning on startup)
- install-agent.sh emits both vars as commented template lines

docker-compose
- certctl-tls-init sidecar generates SAN-valid self-signed cert into
  deploy/test/certs/ on first boot
- All demo-stack curls pin against ca.crt with --cacert

Helm chart
- Three TLS provisioning modes, exactly one required:
  - server.tls.existingSecret (operator-supplied)
  - server.tls.certManager.enabled (cert-manager integration)
  - server.tls.selfSigned.enabled (eval only — not for production)
- server-certificate.yaml template for cert-manager mode
- helm install without a TLS source fails at template render with
  a pointer to docs/tls.md

CI
- .github/workflows/ci.yml Helm Chart Validation step renders the
  chart in both existingSecret and cert-manager modes, plus an
  inverse guard-regression test that asserts helm template MUST
  refuse to render when no TLS source is configured. Previously
  the single `helm template` invocation hit the certctl.tls.required
  fail-loud guard and exit-1'd CI. Four invocations now: lint
  (existingSecret), template (existingSecret), template
  (cert-manager), template (no args — must fail).

Integration tests
- deploy/test/integration_test.go stands up the Compose stack over
  HTTPS, extracts the CA bundle, and exercises every certctl API
  over https://localhost:8443
- All 34 integration subtests green (per Phase 8 local CI-parity)

Documentation
- New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload)
- New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade
  warnings, fleet-roll sequencing)
- CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry
  (file heading unchanged; release tag is v2.0.47)
- All curls in docs/, examples/, deploy/helm/ guides use
  https://localhost:8443 --cacert

Verification
- grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits
- grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin
  API default, SSRF doc comment) — zero certctl endpoints
- Tasks #197–#206 (Phases 0–8) all closed in the tracker

Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix).
2026-04-20 03:43:10 +00:00
Shankar c38ba40200 docs: reconcile scheduler topology across sibling docs (7 → 12 loops)
Authoritative 12-loop table lives at docs/architecture.md:522-534 (committed via
the I-001/I-003/I-005 + M48/M50 milestone commits). This change brings six sibling
docs into parity with that table so every surface — user-facing features reference,
SOC 2 compliance mapping, connectors guide, advanced demo architecture diagram,
testing guide, and in-line architecture prose — reflects the same 8 always-on + 4
opt-in topology.

Touches:
- docs/architecture.md: 2 inline ordinal references (9th / 8th loop) replaced with
  descriptive names (opt-in cloud discovery / opt-in endpoint health), cross-linked
  to the authoritative table to prevent future ordinal rot.
- docs/features.md: metric row (7 → 12), inline reference to 9th loop, and full
  scheduler table expanded to include Always-on column + env vars + I-001/I-003/I-005
  refs.
- docs/compliance-soc2.md: background scheduler monitoring bullets expanded to list
  all 12 loops with env vars + I-series refs; table row updated with 8 always-on +
  4 opt-in summary.
- docs/connectors.md: three inline ordinals (7th/6th/9th loop) replaced with
  descriptive names, cross-linked to architecture.md.
- docs/demo-advanced.md: Mermaid SCHED node label updated from '7 background loops'
  to '12 background loops (8 always-on + 4 opt-in)'.
- docs/testing-guide.md: Test 20.1.1 header + grep pattern expanded to include
  job-retry / job-timeout / notification-retry / digest / endpoint-health /
  cloud-discovery loops; sign-off chart row label updated.

Pure documentation reconciliation. No code changes. Master HEAD pre-commit: 25131a3.
2026-04-20 02:51:34 +00:00
Shankar 25131a377d M-001/M-006: strip HTTP auth from EST/SCEP + fail-loud SCEP preflight
Closes CWE-306 (missing authentication for critical function) for SCEP
via a fail-loud startup gate, and aligns EST/SCEP HTTP dispatch with
their respective RFCs. CRL/OCSP remain unauthenticated under
.well-known/pki/* per RFC 5280 §5 / RFC 6960 / RFC 8615. Option (D):
no mTLS in this milestone.

- RFC 7030 §3.2.3 (EST auth is deployment-specific) and §4.1.1
  (/cacerts explicitly anonymous): EST paths served unauthenticated;
  CSR-signature + profile policy enforce identity inside ESTService.
- RFC 8894 §3.2: SCEP authenticates via the challengePassword
  PKCS#10 attribute (OID 1.2.840.113549.1.9.7), not an HTTP credential.
  HTTP dispatch is unauthenticated; preflightSCEPChallengePassword
  refuses to start when CERTCTL_SCEP_ENABLED=true without
  CERTCTL_SCEP_CHALLENGE_PASSWORD. SCEPService.PKCSReq enforces the
  same invariant defense-in-depth and compares with
  crypto/subtle.ConstantTimeCompare.

cmd/server/main.go:
- Extract buildFinalHandler(apiHandler, noAuthHandler, webDir,
  dashboardEnabled); route /.well-known/est/*, /scep, /scep/*,
  /.well-known/pki/crl/{id}, /.well-known/pki/ocsp/{id}/{serial},
  and health probes through noAuthHandler (RequestID +
  structuredLogger + Recovery only).
- Add preflightSCEPChallengePassword fail-loud gate; startup log
  emits challenge_password_set boolean for operator visibility.

cmd/server/finalhandler_test.go (new, 314 lines, 27 subtests):
- TestBuildFinalHandler_Dispatch (20) + TestBuildFinalHandler_NoDashboard
  (7) pin the dispatch surface: EST 4-endpoint, SCEP exact +
  trailing-slash + query-string, PKI CRL+OCSP, health, /api/v1/*
  authenticated, /assets/* file server, SPA fallback.

internal/api/router/router.go, internal/config/config.go:
- Router-level comments explain why EST/SCEP/PKI dispatchers sit
  outside the authenticated mux; SCEP challenge password config
  plumbed through.

docs/architecture.md:
- New EST Authentication subsection (RFC 7030 §3.2.3 + §4.1.1,
  buildFinalHandler + noAuthHandler references).
- Rewrite SCEP Authentication subsection; replaces pre-existing
  factually-incorrect "any value accepted" claim with CWE-306
  preflight, service-layer defense-in-depth, and
  crypto/subtle.ConstantTimeCompare.
- Top-level Authentication section: qualify /api/v1/* scope on API
  clients bullet; add standards-based-endpoints bullet referencing
  the 27-subtest regression harness.

docs/compliance-soc2.md:
- CC6.1: scope API Key Authentication to /api/v1/*; add
  standards-based endpoints bullet citing RFCs and CWE-306 closure.
- CC6.3: scope API Key Policy to /api/v1/* with cross-reference to
  CC6.1.
- Evidence Locations augmented with buildFinalHandler,
  preflightSCEPChallengePassword, scep.go defense path, regression
  harness, and OpenAPI security:[] overrides.

api/openapi.yaml: verified already correct (global bearerAuth
default overridden with security:[] on /cacerts, /simpleenroll,
/simplereenroll, /csrattrs, /scep GET+POST, /crl/{issuer_id},
/ocsp/{issuer_id}/{serial}); no edits needed.
2026-04-19 17:20:05 +00:00
Shankar 15daf008aa I-005: notification retry loop + dead-letter queue
Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.

Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
  (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
  Dead rows clear next_retry_at so the index stops matching them.

Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
  exponential backoff capped at 1h (notifRetryBackoffCap) with
  5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
  status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
  (audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
  wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
  retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
  status -> pending via notifRepo.Requeue.

Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
  CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.

StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
  NewStatsService call sites (main.go + stats_test.go + 8 digest
  tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
  notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
  (reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
  best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
  the same snapshot.

Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.

Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
  with queryKey: ['notifications', activeTab] so switching tabs
  doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
  full-text title tooltip.
- Requeue mutation wrapped as
    mutationFn: (id: string) => requeueNotification(id)
  to prevent react-query v5's positional context argument from
  leaking into the API client — pinned against future refactors
  by strict-match toHaveBeenCalledWith('notif-dead-001') in
  NotificationsPage.test.tsx:181.

Closes I-005.
2026-04-19 15:17:27 +00:00
Shankar Reddy 49002c8cba Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.

Schema — migration 000015:
  migrations/000015_agent_retire.up.sql flips
  deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
  RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
  boundary instead of quietly destroying targets. Both `agents` and
  `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
  TEXT pair (TEXT not VARCHAR so operator comments are never
  truncated), indexed via partial indexes WHERE retired_at IS NOT
  NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
  CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
  EXISTS) so repeated runs against partially-migrated databases
  converge. migrations/000015_agent_retire.down.sql restores CASCADE
  and drops the new columns for clean rollback. A dedicated
  repository-layer testcontainers test
  (internal/repository/postgres/migration_000015_test.go) asserts the
  before/after FK action, column presence, index presence, and
  round-trip idempotency under up→down→up.

Domain — sentinel guard + dependency counts:
  internal/domain/connector.go gains IsRetired() on Agent, the
  exported SentinelAgentIDs slice listing server-scanner,
  cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
  four reserved IDs documented in CLAUDE.md and created at startup in
  cmd/server/main.go), IsSentinelAgent(id string) predicate,
  AgentDependencyCounts{ActiveTargets, ActiveCertificates,
  PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
  ActorTypeSystem enum values used by audit emission downstream.
  Coverage locked down by internal/domain/connector_test.go.

Service — 8-step ordered contract:
  internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
  opts{Force, Reason}) enforces a fixed execution order:
  (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
      unconditionally; force=true does NOT bypass it.
  (2) fetch — ErrAgentNotFound on miss.
  (3) idempotency — if IsRetired() already, return
      AgentRetirementResult{AlreadyRetired: true} with no new audit
      event and no state change (safe to replay from flaky clients).
  (4) preflight counts — collectAgentDependencyCounts runs
      ActiveTargets, ActiveCertificates, PendingJobs sequentially
      (not in parallel; keeps the per-query timeout predictable and
      matches the repo's existing call-chain shape).
  (5) force-reason guard — opts.Force=true with empty Reason returns
      ErrForceReasonRequired (wired into the 400 status surface).
  (6) dependency guard — HasDependencies() with opts.Force=false
      returns BlockedByDependenciesError{Counts} (wired into the 409
      body with per-bucket counts).
  (7) mutation — single pinned retiredAt := time.Now(); agent
      retirement first, then cascade target retirement if opts.Force,
      all under the repo's single transaction so the two retired_at
      stamps match to the second.
  (8) best-effort audit — agent_retired always; agent_retirement_
      cascaded additionally on the force path. Actor is whatever the
      handler resolves from the request; actor type is mapped by
      resolveActorType (system/agent-prefix→Agent/else→User). Audit
      emission failures are logged via slog.Error but do not abort
      the retirement (matches the house convention used by every
      other scheduler-emitted event).

  BlockedByDependenciesError implements Error() as
  "active_targets=%d, active_certificates=%d, pending_jobs=%d" and
  Unwrap() → ErrBlockedByDependencies. The single struct satisfies
  errors.Is via Unwrap (used by scheduler-level tests) and errors.As
  via the concrete type (used by the handler to fish out Counts for
  the 409 body). ListRetiredAgents(page, perPage) adds a separate
  paginated accessor with page<1→1 and perPage<1→50 normalization so
  retired rows are queryable without polluting the default agent
  listing.

  Sentinel guard coverage is asymmetric by design: all four reserved
  IDs are protected, and force=true cannot override. Regression tests
  in internal/service/agent_retire_test.go assert each of the eight
  steps in order, plus sentinel bypass attempts and idempotency
  replay.

Handler + router — status-code surface:
  internal/api/handler/agents.go:RetireAgent exposes seven status
  codes on DELETE /agents/{id}:
    200 on a fresh retirement (body echoes AgentRetirementResult).
    204 on idempotent replay (AlreadyRetired=true; no new audit).
    400 on ErrForceReasonRequired.
    403 on ErrAgentIsSentinel.
    404 on ErrAgentNotFound.
    409 on BlockedByDependenciesError, with a custom body shape
        {error, counts{active_targets, active_certificates,
        pending_jobs}} that bypasses the default ErrorWithRequestID
        envelope so callers get the per-bucket numbers directly.
    500 on any other error.
  Heartbeat HandleHeartbeat returns 410 Gone when the agent is
  retired (ErrAgentRetired), signalling the agent to shut down.
  Query params `force=true` and `reason=<text>` drive the cascade
  path; both are forwarded as url.Values through the new MCP
  transport.

  internal/api/router/router.go registers GET /api/v1/agents/retired
  literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
  literal-beats-pattern-var precedence routes "retired" to the
  paginated retired-agents listing instead of fetching a hypothetical
  agent named "retired".

Agent binary — clean shutdown on 410:
  cmd/agent/main.go gains the ErrAgentRetired sentinel, a
  retiredOnce sync.Once, and a retiredSignal chan struct{}. A
  markRetired(source, statusCode, body) helper closes the channel
  exactly once; the Run() select loop observes the close and returns
  ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
  and exits cleanly instead of spinning in the heartbeat retry loop.
  The 410 Gone surface is therefore terminal for the agent process.

MCP transport:
  internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
  a new additive transport method. Client.Delete is path-only; without
  this method the retire tool would silently drop `force` and `reason`,
  turning every cascade retire into a default soft-retire. The new
  method shares do()'s 204 normalization and 4xx/5xx error
  propagation so tool authors get one contract.
  internal/mcp/tools.go + internal/mcp/types.go expose the
  retire_agent tool with Force+Reason inputs wired through
  DeleteWithQuery.

CLI:
  cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
  `agents list --retired` (client-side strip of --retired then
  delegation to ListRetiredAgents, sharing --page/--per-page parsing
  with the default listing) and `agents retire <id> [--force --reason
  "…"]` (mirrors ErrForceReasonRequired — force without reason is
  rejected client-side before the request is sent). JSON + table
  output modes both honor the new columns.

Frontend:
  web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
  web/src/api/client.ts + web/src/api/types.ts expose the retire
  endpoint and the retired-listing. 4 new Vitest regression cases.

OpenAPI:
  api/openapi.yaml documents DELETE /agents/{id} with all seven
  status codes, 410 on heartbeat, and the 409 per-bucket body shape.

Regression coverage (six new test files, all green):
  internal/service/agent_retire_test.go           — 8-step contract + sentinel guards
  internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
  internal/mcp/retire_agent_test.go               — DeleteWithQuery wire-through
  internal/cli/agent_retire_test.go               — --retired listing + --force/--reason pairing
  internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
  internal/domain/connector_test.go               — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies

Files:
  api/openapi.yaml                                — DELETE + 410 + 409 body shape
  cmd/agent/main.go                               — ErrAgentRetired, markRetired, retiredSignal
  cmd/cli/main.go                                 — handleAgents list/get/retire dispatch
  docs/architecture.md, docs/concepts.md,
    docs/testing-guide.md                         — retirement contract narrative
  internal/api/handler/agents.go                  — RetireAgent, status surface, 410 on heartbeat
  internal/api/handler/agent_handler_test.go      — extended coverage
  internal/api/handler/agent_retire_handler_test.go — new
  internal/api/router/router.go                   — /agents/retired before /agents/{id}
  internal/cli/agent_retire_test.go               — new
  internal/cli/client.go                          — ListRetiredAgents + RetireAgent
  internal/domain/connector.go                    — IsRetired, SentinelAgentIDs,
                                                    IsSentinelAgent, AgentDependencyCounts,
                                                    ActorTypeAgent/System
  internal/domain/connector_test.go               — new
  internal/integration/lifecycle_test.go          — retirement fixture
  internal/mcp/client.go                          — DeleteWithQuery additive transport
  internal/mcp/retire_agent_test.go               — new
  internal/mcp/tools.go, internal/mcp/types.go    — retire_agent tool + Force/Reason inputs
  internal/repository/interfaces.go               — AgentRepository retirement methods
  internal/repository/postgres/agent.go           — retire + cascade target retire + counts
  internal/repository/postgres/migration_000015_test.go — new
  internal/service/agent.go                       — wire into AgentService surface
  internal/service/agent_retire.go                — new 8-step contract
  internal/service/agent_retire_test.go           — new
  internal/service/deployment.go                  — skip retired agents
  internal/service/target.go                      — skip retired agents
  internal/service/testutil_test.go               — shared mocks extended
  migrations/000015_agent_retire.up.sql           — new
  migrations/000015_agent_retire.down.sql         — new
  web/src/api/client.ts, types.ts + tests         — retire endpoint wiring
  web/src/pages/AgentsPage.tsx                    — retire UI
2026-04-19 05:24:00 +00:00
Shankar Reddy 45361477ed Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001)
Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006)
on top of the C-001/C-002 ownership + agent-FK contract fixes landed in
5c01c7f. The work lands as a single commit spanning server, docs, tests,
and the React client.

M-002 — Named API keys with per-key actor propagation
  * Migration 000014 adds the 'api_keys' table (id, name, hash,
    principal, role, created_at, last_used_at, disabled_at) so every
    credential carries an identifiable principal instead of the
    opaque 'anonymous'/'api-key' sentinel.
  * Auth middleware now rotates through configured keys, performs
    constant-time hash comparison, stamps 'last_used_at', and emits
    an actor struct via contextWithActor(). The audit middleware,
    bulk-revocation handler, approval handlers, and MCP tool layer
    now read the principal off the context and persist it on every
    audit_events row.
  * Regression coverage:
      - internal/api/middleware/audit_test.go — actor propagation,
        principal redaction for disabled keys, anonymous fallback for
        unauthenticated endpoints.
      - internal/api/handler/bulk_revocation_handler_test.go,
        job_handler_test.go — principal-on-audit assertions.

M-003 — Authorization gates (Phase B)
  * Approval handler rejects self-approval / self-rejection with 403
    when the actor principal equals the job's requested_by field.
  * Bulk revocation is gated behind the 'admin' role; operators and
    viewers receive 403.
  * Regression coverage:
      - internal/service/job_test.go — TestApproveJob_NotSelf,
        TestRejectJob_NotSelf.
      - internal/api/handler/bulk_revocation_handler_test.go —
        TestBulkRevoke_RequiresAdmin, TestBulkRevoke_AdminSucceeds.

M-006 — RFC-compliant CRL/OCSP on the unauthenticated .well-known mux
  * Per RFC 8615, relying parties cannot reasonably be asked to
    authenticate against the issuing certctl instance to retrieve
    revocation material. CRL and OCSP move off the authenticated
    '/api/v1/crl*' and '/api/v1/ocsp/*' paths onto:
        GET /.well-known/pki/crl/{issuer_id}
            Content-Type: application/pkix-crl   (RFC 5280 §5)
        GET /.well-known/pki/ocsp/{issuer_id}/{serial}
            Content-Type: application/ocsp-response  (RFC 6960)
  * Non-standard JSON CRL shape is removed; only DER is served.
  * Short-lived certificate exemption (profile TTL < 1h → skip
    CRL/OCSP) is preserved; the response simply omits the serial.
  * Routes are registered on the unauthenticated 'finalHandler' mux
    in cmd/server/main.go alongside EST ('/.well-known/est/*') and
    SCEP ('/scep'). Legacy authenticated paths return 404.
  * Regression coverage:
      - internal/api/handler/certificate_handler_test.go — content
        type, DER parseability, 404 for unknown issuer.
      - internal/api/handler/adversarial_path_test.go — unauthenticated
        access asserted for CRL, OCSP, EST, SCEP.
      - internal/api/router/router_test.go — route-table assertion
        that '.well-known/pki/*', '.well-known/est/*', and '/scep' are
        mounted on the unauthenticated branch.

M-001 — Auto-closed by M-002
  EST and SCEP were already registered on the unauthenticated
  'finalHandler' mux; the router comment at
  internal/api/router/router.go:247 now matches reality. The
  adversarial-path tests above lock the behavior in.

Verification (all gates green):
  * go vet ./...                                           — clean
  * go build ./...                                         — ok
  * go test -short ./... (55+ packages)                    — all pass
  * web/ : npm test (225 Vitest tests)                     — all pass
  * web/ : npx tsc --noEmit                                — clean
  * grep sweep for '/api/v1/(crl|ocsp)' — 13 surviving hits,
    all intentional M-006 tombstone/relocation comments.

Documentation:
  * coverage-gap-audit.md — status flips M-001/M-002/M-003/M-006 →
    Fixed, with per-finding resolution paragraphs citing regression
    test IDs. (Audit file lives outside this repo; see cowork root.)
  * CLAUDE.md Project Status line updated with the auth-unification
    closure note.
  * docs/features.md, docs/architecture.md, docs/quickstart.md,
    docs/concepts.md, docs/connectors.md, docs/test-env.md,
    docs/testing-guide.md, docs/compliance-*.md, docs/demo-advanced.md
    — refreshed for the new '.well-known/pki/*' namespace and named
    API keys.
  * api/openapi.yaml — documents the new unauthenticated endpoints
    and removes the legacy '/api/v1/crl*' + '/api/v1/ocsp/*' paths.

.gitignore: adds '/.gocache/' and '/.gomodcache/' for the session-
scoped Go caches so they never enter the tree.
2026-04-18 18:17:41 +00:00
Shankar Reddy 76d383bd64 fix(crypto): per-ciphertext PBKDF2 salt + v2 versioned format with v1 fallback (M-8) 2026-04-17 05:36:29 +00:00
Shankar 25564021e8 security(globalsign): remove InsecureSkipVerify and pin CA pool (H-5)
The GlobalSign Atlas HVCA connector previously used InsecureSkipVerify:true
on its mTLS TLS config, disabling server certificate validation and
defeating the purpose of the client-side mTLS handshake. This was a
CWE-295 Improper Certificate Validation vulnerability silently degrading
trust on every production call to GlobalSign's signing API.

Remediation (per H-5 audit finding, Lens 4.4):

- Remove InsecureSkipVerify from all three http.Client construction sites
  (ValidateConfig, getHTTPClient, and legacy initialisation path).
- Introduce buildServerTLSConfig() helper that constructs tls.Config with
  MinVersion: tls.VersionTLS12 (addresses adjacent L-1 recommendation).
- New optional config field `server_ca_path` (env:
  CERTCTL_GLOBALSIGN_SERVER_CA_PATH). When unset the connector trusts the
  system root CA bundle (correct default for GlobalSign's publicly-trusted
  HVCA endpoints). When set the bundle is loaded via x509.NewCertPool() +
  AppendCertsFromPEM, and only those roots are trusted (supports private
  HVCA deployments and defence-in-depth root pinning).
- Error wrapping chain: "failed to read server CA bundle at %s" and
  "no valid PEM certificates found in server CA bundle at %s" surface
  config problems at ValidateConfig time instead of silently failing at
  request time.

Docs, config, service env-seed, and GUI issuer type definition updated to
expose the new field. Tests: 9 dead `InsecureSkipVerify: true` client
TLSClientConfig blocks (no-ops against httptest.NewServer plain-HTTP)
replaced with bare http.Client; new TestGlobalSign_ServerTLSConfig covers
pinned-CA trust, untrusted-server rejection, missing-file and invalid-PEM
error paths.

Verification:
- go build ./... clean
- go vet ./... clean
- go test -race ./internal/connector/issuer/globalsign/... ./internal/config/... ./internal/service/... ok
- go test ./... (excluding testcontainers-gated repo layer) ok
- golangci-lint run ./... 0 issues
- govulncheck ./... 0 reachable vulns
- Per-layer coverage: service 68.7% (≥55), handler 83.6% (≥60), domain 82.0% (≥40), middleware 63.8% (≥30)
- globalsign package coverage: 75.9%
- Invariant sweep: 0 InsecureSkipVerify references remain in globalsign
  package (only a test-file comment documenting the removal).
2026-04-17 01:40:58 +00:00
Shankar 9513e513e0 fix: correct BSL 1.1 change date to March 14, 2033
why-certctl.md said March 1, CHART_SUMMARY.md said March 28. The
LICENSE file is authoritative: Change Date is March 14, 2033.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 11:12:49 -04:00
Shankar 4e3927e8b4 feat(V2.2): bulk revocation — filter-based fleet-wide certificate revocation
Add POST /api/v1/certificates/bulk-revoke with filter criteria (profile_id,
owner_id, agent_id, issuer_id, team_id, certificate_ids), partial-failure
tolerance, and audit trail. Includes MCP tool, CLI command (certs bulk-revoke),
server-side bulk modal in GUI replacing client-side sequential loop, OpenAPI
spec, compliance mapping updates, and 21 new tests (12 service, 7 handler,
1 CLI, 1 frontend).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 00:06:34 -04:00
Shankar e1630bcb44 feat(M50): cloud secret manager discovery — AWS SM, Azure KV, GCP SM
Extend certificate discovery from filesystem + network to cloud secret
managers. Three pluggable DiscoverySource connectors feed into the
existing discovery pipeline via sentinel agent pattern, with a 9th
scheduler loop for periodic cloud scanning.

- AWS Secrets Manager: aws-sdk-go-v2, tag/prefix filtering, 10 tests
- Azure Key Vault: stdlib HTTP + OAuth2, base64 DER/PEM, 16 tests
- GCP Secret Manager: stdlib HTTP + JWT OAuth2, label filter, 14 tests
- CloudDiscoveryService orchestrator with 9 tests
- 9th scheduler loop (6h default, atomic.Bool idempotency)
- Discovery page: color-coded source type badges
- 14 new env vars across CloudDiscoveryConfig structs
- Docs: connectors.md, architecture.md, features.md, README updated

49 new tests. All CI checks pass (go vet, race, lint, coverage).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 23:01:00 -04:00
Shankar dd79096b70 feat(M49): Entrust, GlobalSign & EJBCA issuer connectors
Add three new issuer connectors completing commercial and open-source CA
coverage. Entrust uses mTLS client certificate auth with sync/async
issuance. GlobalSign Atlas uses mTLS + API key/secret dual auth with
serial-based tracking. EJBCA supports dual auth (mTLS or OAuth2) for
self-hosted Keyfactor CAs.

Each connector implements the full issuer.Connector interface (9 methods),
includes httptest-based unit tests (~14 each), and follows established
patterns (injectable HTTP clients, RFC 5280 revocation reason mapping,
CRL/OCSP delegated to CA).

Also includes: issuer factory cases, env var seeding, config structs,
domain types, seed data (3 rows, all disabled), OpenAPI enum updates,
frontend issuer catalog entries with config fields, and full docs
(connectors.md, architecture.md, features.md, README).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 22:24:12 -04:00
Shankar de82de953b feat(M48): continuous TLS health monitoring — endpoint state machine, shared tlsprobe, 8 API endpoints, GUI
Adds continuous TLS endpoint health monitoring that closes the deploy→verify→monitor loop.
After M25 verifies a deployment succeeded once, M48 continuously confirms it stays healthy.

Key components:
- Shared `internal/tlsprobe/` package extracted from network scanner for reuse
- Health status state machine: healthy → degraded (2 failures) → down (5 failures),
  plus cert_mismatch when served fingerprint differs from expected
- 8th scheduler loop (60s tick, per-endpoint configurable intervals)
- PostgreSQL migration 000011: endpoint_health_checks + endpoint_health_history tables
- 8 REST API endpoints (CRUD, history, acknowledge, summary)
- Health Monitor GUI page with summary bar, status table, create modal, auto-refresh
- 38 new tests (5 tlsprobe + 11 domain + 10 service + 8 handler + 4 frontend)
- All coverage thresholds maintained (service 68%, handler 83%, domain 87%, middleware 63%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 21:45:45 -04:00
Shankar ff223e2586 feat(M11c): crypto policy enforcement — CSR validation, MaxTTL caps, key metadata
Enforce certificate profile crypto constraints across all 5 issuance paths
(renewal, agent CSR, EST, SCEP). ValidateCSRAgainstProfile() rejects CSRs
with key algorithm/size that don't match profile rules. MaxTTL enforcement
caps certificate validity per issuer connector (Local CA, Vault, step-ca
enforce directly; ACME/DigiCert/Sectigo pass through). Key algorithm and
size are now persisted in certificate_versions for audit compliance.

16 new tests (12 service-layer + 4 Local CA connector). Removes hardcoded
version number from GUI sidebar. Documentation updated across architecture,
features, connectors, and README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 21:05:14 -04:00
Shankar 98bb57e6b4 feat(M51): add SCEP server (RFC 8894) for MDM and network device enrollment
Implements Simple Certificate Enrollment Protocol with single-endpoint
operation-based dispatch (GetCACaps, GetCACert, PKIOperation), PKCS#7
SignedData CSR extraction with fallback for raw/base64 CSR, challenge
password authentication via CSR attributes, and shared internal/pkcs7
package extracted from EST handler to eliminate code duplication.

24 new tests (11 service + 13 handler) plus 5 shared pkcs7 package tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 16:47:18 -04:00
Shankar 8a27557773 tighten BSL license scope, fix documentation underselling shipped features
Broadened BSL Additional Use Grant from "hosted or managed service" to cover
any commercial offering (embedded, bundled, integrated). Updated README to
promote all shipped connectors from Beta to Implemented, added EST/ARI/S/MIME
highlight, Helm quickstart, and corrected license description. Fixed
connectors.md stale claims (AWS ACM PCA listed as planned, K8s Secrets
listed as coming soon) and updated overview with exact connector counts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 15:54:03 -04:00
Shankar 28d8771d8c docs: rewrite features.md, audit README + architecture against repo
Rewrote docs/features.md from scratch as authoritative feature inventory
(1255 lines, every claim verified against source files).

Audited README.md and architecture.md against repo — fixed 19 stale
references: K8s Secrets status, issuer counts, dashboard page counts,
CI thresholds, missing connectors in Mermaid diagrams, OpenAPI operation
count, GetCACertPEM behavior, and V2/V4 roadmap accuracy.

Also includes related fixes discovered during audit:
- Scheduler skips expired/failed/revoked certs from auto-renewal
- Seed demo expiry dates moved outside 31-day scheduler query window
- Agent pages use correct last_heartbeat_at field name

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-15 00:22:57 -04:00
Shankar cabe8d33eb fix: correct K8s Secrets status to 'Coming in 2.1', increase audit trail page size to 200
The Kubernetes Secrets target connector has config validation, tests, UI,
and Helm RBAC implemented but the realK8sClient is a stub — runtime
deployment will fail. Update README and connectors.md to reflect actual
status instead of misleading 'Beta' label.

Also increase the audit trail GUI default from 50 to 200 events per page
(backend already permits up to 500).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 12:11:01 -04:00
Shankar e72f06f35b feat(M47): add Kubernetes Secrets target + AWS ACM PCA issuer connectors
Implement both M47 connectors with full cross-layer wiring:

Kubernetes Secrets target: DNS-1123 validation, kubernetes.io/tls Secret
create-or-update, chain concatenation, serial number validation, Helm
RBAC gating. 18 tests.

AWS ACM Private CA issuer: synchronous issuance (like Vault), ARN regex
validation, RFC 5280 revocation reason mapping, CA cert retrieval,
factory + env var seeding. 23 tests.

Cross-cutting: domain types, service validation, config, factory, agent
dispatch, frontend (TargetsPage, issuerTypes), OpenAPI, seed data, Helm
chart, connectors docs, README. Testing docs (testing-guide, qa-test-guide,
qa_test.go) with Parts thematically integrated near related connectors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-07 20:21:09 -04:00
Shankar f17027c62b test: add unified QA test suite (qa_test.go) replacing legacy bash smoke script
1717-line Go test file covering all 52 Parts of testing-guide.md against the
Docker Compose demo stack. ~120 automated subtests (API, DB, source, perf),
11 skipped Parts with reasons, ~270 manual gaps documented. Audited against
actual router, seed data, domain structs, and migrations — 8 factual bugs
caught and fixed during review. Companion guide at docs/qa-test-guide.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 07:35:38 -04:00
Shankar 128601f545 docs: comprehensive testing guide audit — expand thin Parts, add 11 new connector/feature test sections
Refactored testing-guide.md from V2.0 (42 Parts, 444 tests) to V2.1 (52 Parts, 507 tests):

- Expanded Part 11 (ARI) and Part 19 (Agent Work Routing) with What/Why intro
  paragraphs and per-test annotations explaining the production impact
- Replaced Part 40 (Documentation) passive table with 8 executable verification
  tests (README screenshots, issuer/target type matching, OpenAPI parity, etc.)
- Added Part 39 benchmark tests for Prometheus endpoint and audit trail queries
- Added 11 new Part sections (42-52) covering all previously untested features:
  Envoy, Postfix/Dovecot, SSH, WinCertStore, JavaKeystore, Digest Email,
  Dynamic Issuer/Target Config, Onboarding Wizard, ACME Profiles, Helm Chart
- Fixed stale TOC entries (regenerated from actual headings)
- Removed duplicate TOC block left from previous reorder
- Added sign-off chart entries for all new Parts
- Updated summary: 144 auto (passed) + 88 auto (pending) + 5 skipped + 270 manual = 507 total

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-06 00:43:05 -04:00
Shankar bf947d5e34 chore: move HSM/TPM to V3 paid tier, rename roadmap.md to strategy.md
- HSM/TPM agent key storage and CA key storage moved from V5+ to V3 Pro
  (enterprise compliance gate, not adoption driver)
- Renamed roadmap.md to strategy.md (gitignored, never committed)
- Updated compliance-nist.md HSM references from V5 to V3 Pro

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 23:09:55 -04:00
Shankar 8fe53b7d2d docs: add Docker Compose environments guide and fix compose files
- New deploy/ENVIRONMENTS.md: comprehensive walkthrough of all 4 compose
  files with service-by-service explanations, beginner-friendly Docker
  concepts, and expert-level networking/config details
- Fix docker-compose.dev.yml: agent LOG_LEVEL → CERTCTL_LOG_LEVEL (was
  silently ignored without the CERTCTL_ prefix)
- Add CERTCTL_CONFIG_ENCRYPTION_KEY to base and test compose (enables
  M34/M35 dynamic issuer/target config encryption)
- Add CERTCTL_DISCOVERY_DIRS to base compose agent (enables filesystem
  certificate discovery in default deployment)
- Cross-link ENVIRONMENTS.md from README doc table and quickstart.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 21:57:17 -04:00
Shankar c91067b021 docs: expand README documentation table and fix orphaned doc links
- README: Add 7 missing docs to documentation table (MCP server, OpenAPI
  guide, migration guides for certbot/acme.sh/cert-manager, test
  environment, testing guide). Fix connector reference description to
  remove stale counts. Link OpenAPI guide instead of raw YAML.
- architecture.md: Add cross-references to testing-guide.md and
  test-env.md from testing strategy section and What's Next links.
  These were the only two orphaned docs with zero inbound references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 21:37:47 -04:00
Shankar 6efdb40607 docs: comprehensive documentation audit — fix stale counts, V2/V3 matrix, connector status
- features.md: Fix Feature Matrix to correctly show all V2 Free features
  (F5/IIS/WinCertStore/JavaKeystore as Implemented, not Stub; Vault/DigiCert/
  Sectigo/GoogleCAS as V2 Free, not V3 Paid). Add missing shipped features
  (EST, verification, export, S/MIME, ARI, digest, Helm, onboarding). Update
  issuer count to 9, target count to 13.
- architecture.md: Fix F5/IIS from "interface only, implementation planned"
  to implemented. Add all 13 target connectors to built-in targets list.
- why-certctl.md: Add Sectigo and Google CAS to issuer list (7→9). Fix
  target count (10→13). Remove hardcoded endpoint/operation counts.
- connectors.md: Fix F5 BIG-IP TOC entry from "Interface Only" to
  "Implemented". Remove dead "Planned Issuers" TOC link.
- README.md: Remove competitor product names (CertKit, KeyTalk). Remove
  hardcoded dashboard page count. Remove hardcoded endpoint counts. Fix V4
  roadmap to remove already-shipped issuers (Sectigo, Google CAS).
- Remove hardcoded MCP tool counts (78/80) across 8 files (mcp.md,
  architecture.md, features.md, testing-guide.md, concepts.md, quickstart.md,
  demo-advanced.md, why-certctl.md). Replace with "REST API exposed via MCP"
  to avoid future drift.
- quickstart.md: Docker Compose environments table (from previous session).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 21:33:12 -04:00
Shankar e05424d188 feat(M46): Windows Certificate Store + Java Keystore target connectors, shared certutil package
Extract shared certutil helpers (CreatePFX, ParsePrivateKey, ComputeThumbprint,
GenerateRandomPassword, ParseCertificatePEM) from IIS connector for reuse.
Add WinCertStore connector (PowerShell Import-PfxCertificate, dual local/WinRM
mode, configurable store/location, expired cert cleanup) and JavaKeystore
connector (PEM→PKCS#12→keytool pipeline, JKS/PKCS12 support, shell injection
prevention, path traversal protection). 53 new tests, all passing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 19:14:32 -04:00
Shankar 104ded63ca feat(M45): ACME certificate profile selection, ARI RFC 9773 renumber, 45-day renewal positioning
Three related ACME ecosystem changes shipped as a single milestone:

1. ACME Certificate Profile Selection: Custom JWS-signed newOrder POST with
   `profile` field (e.g., `tlsserver`, `shortlived` for 6-day certs) bypassing
   acme.Client.AuthorizeOrder() since golang.org/x/crypto lacks profile support.
   ES256 JWS signing with kid mode, nonce management, directory discovery.
   Empty profile delegates to standard library path (zero behavior change).
   Configurable via CERTCTL_ACME_PROFILE env var. GUI: profile dropdown on
   ACME issuer config.

2. ARI RFC 9702 → 9773 Renumber: All 25+ references updated across Go source,
   docs, README, and examples. Zero remaining occurrences of RFC 9702.

3. 45-Day / Short-Lived Certificate Positioning: 5 domain tests validating
   renewal thresholds against SC-081v3 validity reduction timeline (200→100→47
   days) and Let's Encrypt 45-day/6-day profiles. ARI (RFC 9773) is the
   expected renewal path for 6-day shortlived certs.

New tests: 13 profile + 5 domain threshold + 1 frontend = 19 new tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 13:52:13 -04:00
Shankar 3cf75ffb73 feat(M38): SSH target connector for agentless deployment via SSH/SFTP
Adds a new target connector enabling certificate deployment to any
Linux/Unix server without installing the certctl agent binary. Uses the
proxy agent pattern — a single agent in the same network zone deploys
certs to remote servers over SSH/SFTP.

Key additions:
- SSH/SFTP connector with key auth (file/inline) + password auth
- Injectable SSHClient interface for cross-platform testing (25 tests)
- Shell injection prevention via validation.ValidateShellCommand()
- Configurable cert/key/chain paths with octal permissions
- GUI: 11 SSH config fields in target create wizard

Also fixes pre-existing frontend bug where all target type strings
(nginx, apache, etc.) were sent as lowercase but the backend expects
proper-case (NGINX, Apache, etc.), breaking GUI-created targets.
Adds missing TargetTypeSSH to validTargetTypes service map.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 12:36:01 -04:00
Shankar cacde5a0ce docs: remove hardcoded test counts from public-facing docs
Replace brittle test count numbers (1,554+, 1,088+, 211, etc.) with
descriptions of testing approach and CI-enforced coverage gates.
Counts go stale every milestone — coverage thresholds are machine-
verified and never drift.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-04 00:20:22 -04:00
Shankar 2a0285f498 feat(M40): F5 BIG-IP target connector via iControl REST
Replace 190-line stub with full iControl REST implementation (~580 lines).
Token auth with 401 auto-retry, file upload + crypto object install,
transaction-based atomic SSL profile updates, cleanup on failure.
Injectable F5Client interface for cross-platform testing. 32 tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 22:26:58 -04:00
Shankar 4cc91d0f28 feat(M44): Google CAS issuer connector
Google Cloud Certificate Authority Service integration via REST API
with OAuth2 service account auth (JWT→access token). Synchronous
issuance model, CA pool selection, mutex-guarded token caching,
revocation with RFC 5280 reason mapping. No Google SDK dependency —
all stdlib. 19 tests with httptest mock OAuth2 + CAS API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 21:25:34 -04:00
Shankar b0c7d57aa1 feat(M43): Sectigo SCM issuer connector
Implement Sectigo Certificate Manager REST API connector with async
order model (enroll → poll → collect PEM), 3-header auth, DV/OV/EV
support, collect-not-ready (400/-183) graceful handling, and RFC 5280
revocation reason mapping. 20 tests with httptest mock API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 21:01:14 -04:00
Shankar ae4093d5f9 docs: add deployment examples index and cross-link migration guides
Create docs/examples.md as the central entry point for all 5 turnkey
docker-compose scenarios with a decision matrix, per-example summaries,
and contextual migration guide links. Update quickstart.md to bridge
from demo to real deployment. Consolidate README docs table (10 rows
from 13). Fix Vault PKI "(planned)" in cert-manager guide.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 17:41:23 -04:00
Shankar b9e8630cd7 docs: fix stale references, seed data case bugs, and convert ASCII diagrams to Mermaid
Audit all docs and examples against current codebase state. Fix seed_demo.sql
domain constant casing (IssuerType, TargetType, AgentStatus) that would cause
agent dispatch failures. Fix example docker-compose health endpoints (/health
not /api/v1/health) and env var names (CERTCTL_DATABASE_URL). Update connector
counts, test numbers, and planned→implemented status across docs. Convert 3
ASCII flow diagrams to Mermaid.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 16:11:42 -04:00
Shankar b1178c177c docs: rewrite why-certctl positioning page
Fix stale competitive claims (IIS shipped in M39, target count now 10),
add 47-day operational math as forcing function, add credibility signals
(1554 tests, 97 API operations, CI pipeline), restructure competitive
comparisons by category for scannability, add "What Else Ships Free"
feature surface section, add "Who Should Look Elsewhere" disqualification,
move ownership message to opening paragraph.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 15:50:41 -04:00
Shankar 8dc78b7455 feat(M42): Postfix/Dovecot mail server target connector
Dual-mode TLS connector for mail servers — single package with mode
field selecting Postfix or Dovecot defaults. File-based cert/key
deployment with correct permissions (cert 0644, key 0600), optional
chain append, shell injection prevention, and configurable
reload/validate commands. 18 tests covering config validation,
deployment, and security. GUI wizard fields and OpenAPI enum updated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 01:46:15 -04:00
Shankar c47d83ccf5 feat(M41): Envoy target connector with SDS support
File-based deployment for Envoy service mesh — writes cert/key/chain
to watched directory with optional SDS JSON config for xDS bootstrap.
Path traversal prevention, configurable filenames, 15 tests passing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-03 01:23:35 -04:00
Shankar e555af9288 feat(M39): IIS WinRM proxy agent mode + front-to-back wiring
Complete the IIS target connector with dual-mode deployment:
- WinRM proxy agent mode via masterzen/winrm for remote Windows servers
- Base64 PFX transfer with try/finally cleanup on remote host
- GUI wizard updated with 13 IIS config fields including WinRM settings
- TargetDetailPage sensitive field redaction (password/secret/token/key)
- OpenAPI TargetType enum updated (added Traefik, Caddy)
- connectors.md fully documented with WinRM proxy config example
- 38 total IIS tests (10 new WinRM tests), all passing with race detection

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-02 20:53:20 -04:00
Shankar 8f1dd10eac feat(M39): IIS target connector + README overhaul
Implement full IIS target connector with PEM-to-PFX conversion via
go-pkcs12, PowerShell-based deployment (Import-PfxCertificate, IIS
binding management), SHA-1 thumbprint computation, and SNI support.
Injectable PowerShellExecutor interface enables cross-platform testing.
Regex-validated config fields prevent PowerShell injection. 28 tests.

Restructure README from 563 to 313 lines: outcome-focused feature
descriptions, "Who Is This For" persona section, examples promoted
above the fold, configuration/API/security reference moved to docs.
All numbers verified against repo (25 GUI pages, 97 OpenAPI ops,
CI thresholds service 55%/handler 60%/domain 40%/middleware 30%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-02 20:27:27 -04:00
Shankar c145cedfd0 feat: S/MIME certificate support in integration tests + test env docs
Add S/MIME (emailProtection EKU) end-to-end test coverage:
- ValidateCommonName() now accepts email addresses for S/MIME certs
- S/MIME test profile (prof-test-smime) in seed data
- Phase 11 test: issuance, EKU, KeyUsage, email SAN verification
- EST config enabled in test Docker Compose
- Portable KeyUsage parsing (awk, works on BSD/GNU)
- Full test environment documentation (docs/test-env.md)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-02 18:32:57 -04:00
Shankar 8dc68381e7 feat: frontend audit fixes, README accuracy pass, doc updates
Frontend audit (10 categories): lifecycle fields in types, new API
functions (CRL, OCSP, deployments, updateIssuer/Target, getPolicy),
issuer/owner/profile filters on CertificatesPage, last_renewal_at
column, error_message column on JobsPage, full crypto policy UI on
ProfilesPage (key algorithms, EKUs, SAN patterns), key info + CA
badge on DiscoveryPage, edit modal on TargetDetailPage, tags field
on certificate creation, darwin→macOS mapping on AgentFleetPage.
211 Vitest tests passing.

README accuracy: test counts (1300+ Go, 211 frontend), page count
(24), demo data (32 certs, 7 issuers, 180 days), endpoint count
(97), MCP tools (80), CLI subcommands (10), moved shipped items
out of "Coming in v2.1.0".

Docs: architecture.md diagrams updated (Vault PKI, DigiCert,
Traefik, Caddy added), features.md Vault/DigiCert status updated.
Version bumped to v2.0.20. cli binary removed from git tracking.
Testing guide Part 41 added (12 auto + 9 manual tests).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 22:10:45 -04:00
Shankar 45531ebbba feat: add issuer catalog page with type discovery + fix cert creation defaults (M33)
Issuer Catalog (M33):
- Shared issuer type config (issuerTypes.ts) with 6 supported + 2 coming-soon types
- Composable wizard components (TypeSelector, ConfigForm, ConfigDetailModal)
- Catalog card layout with Connected/Available/Coming Soon badges
- VaultPKI and DigiCert added to create wizard with full config fields
- ACME EAB fields (eab_kid, eab_hmac with sensitive flag)
- Issuer type filter dropdown on configured issuers table
- Config detail modal replacing 60-char truncation
- IssuerDetailPage uses shared typeLabels/redactConfig, Edit button, enabled/disabled status
- StatusBadge extended with Enabled/Disabled styles
- 2 new frontend tests (VaultPKI + DigiCert create payload verification)

Bug fixes:
- CertificateService.CreateCertificate now defaults Status to Pending and Tags to
  empty map when not set (DB column DEFAULTs only apply when columns are omitted
  from INSERT, but our repo always includes all columns)
- CreateCertificate handler now logs actual error via slog.Error before returning
  generic 500, enabling root cause debugging

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 18:58:23 -04:00
Shankar 078ba5ab7b feat: add Vault PKI and DigiCert CertCentral issuer connectors (M32 + M37)
Vault PKI: synchronous issuance via /v1/{mount}/sign/{role}, token auth,
revocation, CA cert retrieval, 14 tests. DigiCert CertCentral: async order
model (submit → poll → download), X-DC-DEVKEY auth, OV/EV support, PEM
bundle parsing, 16 tests. Both conditionally registered based on env vars.
Includes OpenAPI enum updates, seed data, connector docs, architecture docs,
README badges, and testing guide sign-off (Parts 38 + 39, 12 automated
smoke test assertions all passing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 17:19:46 -04:00
Shankar 23ac38bef6 feat(Pre-2.1.0-E): GUI completeness — 5 new pages, clickable nav, verification badges
Wire all remaining backend features to the frontend GUI:

New pages:
- DigestPage: preview digest HTML via iframe + send with confirmation
- ObservabilityPage: health status, metrics gauges, Prometheus config + live output
- JobDetailPage: full job details, verification section, timeline, audit events
- IssuerDetailPage: redacted config, test connection, issued certificates list
- TargetDetailPage: config, agent link, deployment history with verification

Existing page updates:
- JobsPage: clickable job IDs, verification column with VerificationBadge
- IssuersPage: clickable issuer names linking to detail page
- TargetsPage: clickable target names linking to detail page
- Sidebar: Digest and Observability nav items
- 5 new routes in main.tsx

API client: getJob, getIssuer, getTarget, getJobVerification, getPrometheusMetrics
Tests: 7 new Vitest tests (203 total), testing-guide Part 37 (17 manual tests)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 14:10:58 -04:00
Shankar 5096dc5d71 feat(M31): agent work routing — scope jobs to assigned agents
Deployment jobs now set agent_id from target→agent relationship at
creation time. GetPendingWork() uses ListPendingByAgentID() with a
3-way UNION query (direct match, legacy NULL fallback via target JOIN,
AwaitingCSR via cert→target→agent chain) so each agent only receives
its own jobs.

- Added AgentID *string to Job domain struct
- Added agent_id to all job SQL queries (5 SELECTs, INSERT, UPDATE, scanJob)
- New ListPendingByAgentID() repository method
- Rewrote GetPendingWork() from ~25 lines to single scoped query
- 4 new Go tests (3 agent routing + 1 deployment agent_id)
- Frontend: agent_id/target_id on Job type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 14:10:42 -04:00
Shankar 2caf03c543 feat: wire ARI (RFC 9702) into renewal scheduler
CheckExpiringCertificates() now queries each issuer's ARI endpoint
before creating renewal jobs. If the CA says "not yet" (suggested
window hasn't opened), renewal is deferred. ARI errors fall back
gracefully to threshold-based logic. Audit trail records
renewal_trigger=ari when ARI drives the decision.

4 new unit tests: ShouldRenewNow, NotYet, NilFallback, ErrorFallback.
3 new smoke tests in testing-guide.md Part 35.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 12:11:42 -04:00
Shankar 320c8ae2ca fix(docs): correct migration guides — 17 issues found via repo audit
Fixes factual errors, broken links, wrong ports, inaccurate GUI
descriptions, and misleading config formats across all three migration
guides (certbot, acme.sh, cert-manager).

Key fixes:
- Correct server port from 8080/3000 to 8443 across all guides
- Fix HTTPS→HTTP for Docker Compose (not TLS-terminated)
- Fix heartbeat interval: 60 seconds, not 5 minutes
- Fix "50 servers" → "10 servers" (50 certs across 10 servers)
- Replace JSON config blocks with env var format (actual config method)
- Fix policy creation flow to match actual GUI (name/type/severity/config)
- Fix issuer wizard description to match actual 2-step flow
- Fix Vault PKI "coming in v2.1" → "planned" (ships post-2.1.0)
- Fix 5 broken links (cert-manager.md, quickstart anchors, architecture anchor)
- Remove claim of auto-generated suggestions in discovery flow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 01:34:22 -04:00
Shankar 380fcab42e fix: resolve NULL csr_pem scan errors and QA smoke test failures
Root cause: certificate_versions.csr_pem is nullable in the schema but
Go code scanned it into a plain string. Used sql.NullString in
ListVersions and GetLatestVersion to handle NULL values correctly.

Also includes: partial update fetch-merge-update pattern to prevent FK
violations, nil directory guard in discovery service, diagnostic slog
logging in handlers, export handler 422 for unparseable PEM, OpenAPI
spec corrections, MCP tool description improvements, and test fixes.

Rewrites the Release Sign-Off section in testing-guide.md to individual
test-level granularity (320 rows) with smoke test results audited and
checked off (121 pass, 5 skip, 194 manual remaining).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 00:51:18 -04:00